We have the data of a mobile games users. We have 30k users’ data and ~25 features along with our response column “churn”. Our aim is to analyze the data and use machine learning algorithms to fit a model that can predict the churn behavior of users on another test set that consists of 5000 users other than the 30k we got, but that we don’t know the churn behavior of.
I will do exploratory data analysis where I will look at the data structure, correlations, visualizations and try to get insights from the data.
With the insights that I got, I will do feature engineering to change the features with data manipulation or generate additional features by doing operations on the features that I already have.
Lastly, I will fit models to different data sets I created, and will use the best combination (model and data set) in terms of accuracy (I will split my training data to train and test set to calculate this) to do predictions on the actual test data (which we don’t know the churn behavior of).
If I succeed, this model can be used to predict the churn behavior of users that sign up, and take business actions accordingly (like sending some special campaigns to those who we predict that will churn). Other than this, our exploratory data analysis insights can be used to improve the game in a way that will decrease the churn ratio of users by improving the game mechanics or features.
First, we will import our data to R and review it.
x <- read.csv("ml_project_2022_churn_train.csv")
str(x)
## 'data.frame': 30000 obs. of 26 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ X_id : int 88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
## $ os : chr "android" "android" "android" "android" ...
## $ country : chr "IT" "BR" "RO" "LT" ...
## $ device_brand : chr "samsung" "Xiaomi" "HUAWEI" "HUAWEI" ...
## $ device_model : chr "SM-A125F" "Redmi 8" "CLT-L29" "SNE-LX1" ...
## $ session_cnt : int 7 1 1 2 1 3 1 3 1 8 ...
## $ session_length : int 13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
## $ lang : chr "IT" "BR" "RO" "LT" ...
## $ max_lvl_no : int 13 9 28 14 12 20 9 23 9 1 ...
## $ gameplay_duration: int 1088 1898 1685 1221 532 3018 393 747 382 1435 ...
## $ bonus_cnt : int 6 9 8 2 4 23 3 3 0 15 ...
## $ hint1_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : int 2 1 0 0 1 0 0 0 0 0 ...
## $ hint3_cnt : int 0 0 0 0 0 0 0 0 0 2 ...
## $ repeat_cnt : int 7 10 21 7 10 30 4 1 1 38 ...
## $ gold_cnt : int 418 25 990 1360 1367 585 620 2141 665 500 ...
## $ claim_gold_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ banner_cnt : int 17 38 0 49 23 154 16 33 9 14 ...
## $ is_cnt : int 1 0 16 2 0 7 0 11 0 2 ...
## $ rv_cnt : int 0 1 1 1 1 3 0 1 1 4 ...
## $ churn : int 1 1 1 1 1 1 1 1 1 1 ...
## $ multi_lang : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ max_ses_length : int 12451 1809 2141 1248 666 1817 503 449 520 605 ...
## $ median_ses_length: int 100 1809 2141 649 666 1609 503 372 520 193 ...
## $ avg_ses_length : int 1880 1809 2141 649 666 1171 503 379 520 236 ...
We can see that (other than the response variable), our data consists
of 7 categorical and 18 numerical variables (we don’t include the first
column X which is the row number and not really a feature) and we have
30000 observations.
We can remove X from our dataset since it is not a feature.
x <- x[,!names(x) %in% c("X")]
We will start our EDA with the descriptive (univariate) analysis of our
features one by one.
which(is.na(x))
## integer(0)
sum(is.na(x))
## [1] 0
There are no missing values on our data whatsoever. This is good since we won’t have to worry about them.
Our response column is “churn”. We calculate the ratio of churners vs non churners:
sum(x$churn)/30000
## [1] 0.6269333
We find that ~%63 of our observations are churners.
To analyze categorical features, we first factorize our categorical features from chr or logi to Factor. We also factorize our response column “churn” even if it is numerical because it is actually a binary categorical variable consisting of 2 levels (0 = not churned and 1 = churned).
x <- as.data.frame(unclass(x), stringsAsFactors = TRUE) #factorize chr's
x$multi_lang <- as.factor(x$multi_lang)
x$churn <- as.factor(x$churn) #factorize churn
str(x)
## 'data.frame': 30000 obs. of 25 variables:
## $ X_id : int 88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
## $ os : Factor w/ 1 level "android": 1 1 1 1 1 1 1 1 1 1 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : int 7 1 1 2 1 3 1 3 1 8 ...
## $ session_length : int 13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : int 13 9 28 14 12 20 9 23 9 1 ...
## $ gameplay_duration: int 1088 1898 1685 1221 532 3018 393 747 382 1435 ...
## $ bonus_cnt : int 6 9 8 2 4 23 3 3 0 15 ...
## $ hint1_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : int 2 1 0 0 1 0 0 0 0 0 ...
## $ hint3_cnt : int 0 0 0 0 0 0 0 0 0 2 ...
## $ repeat_cnt : int 7 10 21 7 10 30 4 1 1 38 ...
## $ gold_cnt : int 418 25 990 1360 1367 585 620 2141 665 500 ...
## $ claim_gold_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ banner_cnt : int 17 38 0 49 23 154 16 33 9 14 ...
## $ is_cnt : int 1 0 16 2 0 7 0 11 0 2 ...
## $ rv_cnt : int 0 1 1 1 1 3 0 1 1 4 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : int 12451 1809 2141 1248 666 1817 503 449 520 605 ...
## $ median_ses_length: int 100 1809 2141 649 666 1609 503 372 520 193 ...
## $ avg_ses_length : int 1880 1809 2141 649 666 1171 503 379 520 236 ...
We will draw bar graphs of our categorical variables with respect to churn to see the distributions using ggplot.
library(ggplot2)
categorical_names <- names(Filter(is.factor, x))
for (i in categorical_names) {
print(ggplot(x, aes(x[,i])) + geom_bar(aes(fill = churn), position = "dodge") + labs(x = i))
}
At first glance, we can see that we have a non informative (os has
only one level) variable, and variables with low entropy or class
imbalance problems. Now let’s analyze them one by one.
OS
OS column only has one level, which means that it is a non informative
feature which has no importance whatsoever predicting churn. This means
that we can remove this variable from our dataset to reduce complexity
and computation cost while having no drawbacks.
x <- x[,!names(x) %in% c("os")]
Country
For many of its attributes, it seems that the country feature suffers from the low entropy problem, because it has lots of attributes that occurs on few observations. From the graph, we can say that ~15 attributes of 72 covers most of the data.
We can deal with this problem by combining the categories with few
observations as another category and call it “Other” (1), or find
another highly correlated categorical feature and use only that feature
like language, removing country altogether (2).
Other than this problem, it seems that the churn ratios differ among
countries, which means that this variable can give us valuable
information predicting churn.
Language
Language seems like another version of country that has less levels
and less significant low entropy/class imbalance problems. Again ~15
attributes out of 28(significantly lesser than country) cover most of
the observations.
If country and language are highly correlated like one would expect, we
can choose one of them and remove the other feature.
Device Brand
We see the same problem on country (low entropy and class imbalance)
but in an even bigger way in this feature.
We also do not see any significant difference on churn ratios between
device brands, but we can also check this by simply drawing churn ratio
graphs for the most significant levels of this feature.
brand_names_150 <- names(which(table(x$device_brand) > 150))
brand_names_150
## [1] "" "HMD Global" "HUAWEI" "LENOVO" "LGE"
## [6] "motorola" "OnePlus" "OPPO" "realme" "samsung"
## [11] "TCL" "Xiaomi"
library(data.table)
xvdt <- data.table(x)
brand150 <- xvdt[device_brand %in% brand_names_150]
brand150
## X_id country device_brand device_model session_cnt session_length
## 1: 88464659 IT samsung SM-A125F 7 13162
## 2: 88466479 BR Xiaomi Redmi 8 1 1809
## 3: 88468912 RO HUAWEI CLT-L29 1 2141
## 4: 88471351 LT HUAWEI SNE-LX1 2 1298
## 5: 88473161 RO HUAWEI LYA-L29 1 666
## ---
## 29074: 95099191 IT Xiaomi Redmi Note 8T 4 10668
## 29075: 95105634 BG realme RMX3263 2 2104
## 29076: 95111824 BR Xiaomi Mi Note 10 Lite 1 397
## 29077: 95117734 GR samsung SM-A217F 3 1548
## 29078: 95122431 GR samsung SM-G975F 3 390
## lang max_lvl_no gameplay_duration bonus_cnt hint1_cnt hint2_cnt
## 1: IT 13 1088 6 0 2
## 2: BR 9 1898 9 0 1
## 3: RO 28 1685 8 0 0
## 4: LT 14 1221 2 0 0
## 5: RO 12 532 4 0 1
## ---
## 29074: IT 26 8390 52 0 4
## 29075: BU 17 2046 2 0 5
## 29076: BR 4 408 3 0 4
## 29077: GR 20 1248 3 0 1
## 29078: GR 8 310 0 0 0
## hint3_cnt repeat_cnt gold_cnt claim_gold_cnt banner_cnt is_cnt rv_cnt
## 1: 0 7 418 0 17 1 0
## 2: 0 10 25 0 38 0 1
## 3: 0 21 990 0 0 16 1
## 4: 0 7 1360 0 49 2 1
## 5: 0 10 1367 0 23 0 1
## ---
## 29074: 6 48 1585 0 346 38 20
## 29075: 0 12 72 0 64 5 2
## 29076: 0 3 545 0 0 0 1
## 29077: 0 3 115 0 48 5 2
## 29078: 0 1 605 0 8 0 0
## churn multi_lang max_ses_length median_ses_length avg_ses_length
## 1: 1 FALSE 12451 100 1880
## 2: 1 FALSE 1809 1809 1809
## 3: 1 FALSE 2141 2141 2141
## 4: 1 FALSE 1248 649 649
## 5: 1 FALSE 666 666 666
## ---
## 29074: 1 FALSE 3846 2992 2667
## 29075: 0 FALSE 1784 1052 1052
## 29076: 1 FALSE 397 397 397
## 29077: 0 FALSE 621 582 516
## 29078: 0 FALSE 388 2 130
library(ggplot2)
library(forcats) #we will use fct_infreq to plot the brands in descending order by their counts
ggplot(brand150, aes(fct_infreq(device_brand))) + geom_bar(aes(fill = churn), position = "dodge")
ggplot(brand150, aes(fct_infreq(device_brand))) + geom_bar(aes(fill = churn), position = "fill")
We can see that when we only take the brands that appear on more than
150 observations into account we reduce the level count from 113 to 12,
by overlooking ony 922 observations.
We see differentiation between churn ratios of different brands, which
means that this feature may give us valuable information on churn
We also see a level with name ““, which is empty. This might be the
users devices that we don’t have access to.
Device Model
Device model suffers from high cardinality since it has too many levels, so we apply the same operation to it.
model_names_100 <- names(which(table(x$device_model) > 100))
model_names_100
## [1] "" "ANE-LX1" "CLT-L29" "DUB-LX1"
## [5] "ELE-L29" "EML-L29" "FIG-LX1" "M2003J15SC"
## [9] "M2004J19C" "M2006C3MNG" "M2007J20CG" "M2010J19SY"
## [13] "M2101K6G" "M2101K7AG" "M2101K7BNY" "M2102J20SG"
## [17] "M2103K19G" "MAR-LX1A" "MRD-LX1" "POT-LX1"
## [21] "Redmi 8" "Redmi Note 7" "Redmi Note 8" "Redmi Note 8 Pro"
## [25] "Redmi Note 8T" "Redmi Note 9 Pro" "Redmi Note 9S" "SM-A105FN"
## [29] "SM-A125F" "SM-A127F" "SM-A202F" "SM-A217F"
## [33] "SM-A307FN" "SM-A315G" "SM-A325F" "SM-A326B"
## [37] "SM-A405FN" "SM-A415F" "SM-A505FN" "SM-A515F"
## [41] "SM-A520F" "SM-A525F" "SM-A528B" "SM-A600FN"
## [45] "SM-A705FN" "SM-A715F" "SM-A725F" "SM-A750FN"
## [49] "SM-G780F" "SM-G780G" "SM-G781B" "SM-G950F"
## [53] "SM-G955F" "SM-G960F" "SM-G965F" "SM-G970F"
## [57] "SM-G973F" "SM-G975F" "SM-G980F" "SM-G991B"
## [61] "SM-G996B" "SM-J415FN" "SM-J600FN" "SNE-LX1"
## [65] "STK-LX1" "VOG-L29" "YAL-L21"
library(data.table)
xvdt <- data.table(x)
model100 <- xvdt[device_model %in% model_names_100]
model100
## X_id country device_brand device_model session_cnt session_length
## 1: 88464659 IT samsung SM-A125F 7 13162
## 2: 88466479 BR Xiaomi Redmi 8 1 1809
## 3: 88468912 RO HUAWEI CLT-L29 1 2141
## 4: 88471351 LT HUAWEI SNE-LX1 2 1298
## 5: 88476408 FR Xiaomi Redmi Note 8 1 503
## ---
## 17122: 95086724 PL HUAWEI POT-LX1 34 45765
## 17123: 95092582 BG samsung SM-A125F 10 8133
## 17124: 95099191 IT Xiaomi Redmi Note 8T 4 10668
## 17125: 95117734 GR samsung SM-A217F 3 1548
## 17126: 95122431 GR samsung SM-G975F 3 390
## lang max_lvl_no gameplay_duration bonus_cnt hint1_cnt hint2_cnt
## 1: IT 13 1088 6 0 2
## 2: BR 9 1898 9 0 1
## 3: RO 28 1685 8 0 0
## 4: LT 14 1221 2 0 0
## 5: FR 9 393 3 0 0
## ---
## 17122: PL 2 25228 133 0 25
## 17123: BU 23 10833 1 0 1
## 17124: IT 26 8390 52 0 4
## 17125: GR 20 1248 3 0 1
## 17126: GR 8 310 0 0 0
## hint3_cnt repeat_cnt gold_cnt claim_gold_cnt banner_cnt is_cnt rv_cnt
## 1: 0 7 418 0 17 1 0
## 2: 0 10 25 0 38 0 1
## 3: 0 21 990 0 0 16 1
## 4: 0 7 1360 0 49 2 1
## 5: 0 4 620 0 16 0 0
## ---
## 17122: 33 242 515 0 302 75 45
## 17123: 0 39 1870 0 136 8 2
## 17124: 6 48 1585 0 346 38 20
## 17125: 0 3 115 0 48 5 2
## 17126: 0 1 605 0 8 0 0
## churn multi_lang max_ses_length median_ses_length avg_ses_length
## 1: 1 FALSE 12451 100 1880
## 2: 1 FALSE 1809 1809 1809
## 3: 1 FALSE 2141 2141 2141
## 4: 1 FALSE 1248 649 649
## 5: 1 FALSE 503 503 503
## ---
## 17122: 1 FALSE 18593 400 1346
## 17123: 0 FALSE 3407 231 813
## 17124: 1 FALSE 3846 2992 2667
## 17125: 0 FALSE 621 582 516
## 17126: 0 FALSE 388 2 130
library(ggplot2)
library(forcats) #we will use fct_infreq to plot the brands in descending order by their counts
ggplot(model100, aes(fct_infreq(device_model))) + geom_bar(aes(fill = churn), position = "dodge")
ggplot(model100, aes(fct_infreq(device_model))) + geom_bar(aes(fill = churn), position = "fill")
When we oversee the models that appear on less than 100 observations,
the level count is almost 70 which is still too big and we overlook
nearly 13000 observations, which is also too big. This means that we
can’t handle the high cardinality problem of device_model by simply
assigning some of the attributes to another attribute “other”.
We also don’t see a relationship between brands of the models and the
churn ratio from the graphs.For example, some of the Samsung models have
close to average churn ratios, while some of them have way smaller.
Device brand feature will still cover some of the information that
device model gives us, but we can’t hope to gain almost all the
information from device brand that we would get from device model since
they won’t be as correlated as language and country.
With these information, we can either remove the device_model column
completely (1), or do some feature engineering and only take the models
that have way larger or smaller churn ratios than average, and use them
as binary (yes or no) categorical features (2). For example, we can take
model x which has churn ratio of 0.40 (way smaller than average) and
include it as another binary categorical feature to our model as x_yes
(attributes = yes (model is x) or no (model is not x)).
Multi Language
We see that the feature heavily suffers from class imbalance as most
of the users do not use multiple languages as expected.
(1) Seems like users who do not use multiple languages does not give us
any information since the ratio is very close to average as
expected.
table(x$multi_lang)
##
## FALSE TRUE
## 29359 641
ggplot(x, aes(multi_lang)) + geom_bar(aes(fill = churn), position = "fill")
We can see from the plot that users who use multi_lang (641 observations) have way higher than average churn ratio (~%87).
x_numeric <- x[, sapply(x, is.numeric)]
library(sjmisc)
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
descr(x_numeric, show = c("mean", "sd", "md", "trimmed", "range", "iqr", "skew"))
##
## ## Basic descriptive statistics
##
## var mean sd md trimmed
## X_id 91090972.38 2069047.39 90726232 90940005.12
## session_cnt 8.58 43.19 3 3.55
## session_length 9082.73 65571.48 1926 2764.35
## max_lvl_no 21.71 55.82 13 16.12
## gameplay_duration 7409.82 214476.01 1453 2091.74
## bonus_cnt 40.14 425.68 6 9.63
## hint1_cnt 0.46 20.46 0 0.00
## hint2_cnt 2.22 8.84 0 0.95
## hint3_cnt 11.43 193.51 0 0.48
## repeat_cnt 71.87 543.95 12 19.02
## gold_cnt 843.29 2447.10 662 724.81
## claim_gold_cnt 0.00 0.00 0 0.00
## banner_cnt 165.05 742.50 46 68.74
## is_cnt 22.82 104.07 5 9.02
## rv_cnt 10.39 131.55 2 2.44
## max_ses_length 3262.95 37949.28 1191 1481.93
## median_ses_length 865.83 1279.61 535 648.11
## avg_ses_length 1097.15 4729.89 656 784.54
## range iqr skew
## 6661078 (88463874-95124952) 3713191.00 0.46
## 2294 (1-2295) 4.00 22.70
## 5099724 (12-5099736) 3616.25 34.57
## 3615 (1-3616) 18.00 34.26
## 36562745 (4-36562749) 2802.25 165.19
## 35727 (0-35727) 15.00 43.89
## 2857 (0-2857) 0.00 105.42
## 563 (0-563) 2.00 26.27
## 22382 (0-22382) 1.00 71.63
## 30460 (0-30460) 29.00 30.05
## 217085 (0-217085) 570.00 67.86
## 0 (0-0) 0.00 NaN
## 30602 (0-30602) 103.00 18.83
## 8774 (0-8774) 18.00 31.51
## 13504 (0-13504) 4.00 62.40
## 5095045 (12-5095057) 1686.25 93.45
## 86854 (0-86854) 730.00 16.54
## 690590 (12-690602) 815.00 110.69
numeric_names <- names(Filter(is.numeric, x))
for (i in numeric_names) {
print(ggplot(x, aes(x[,i])) + geom_histogram(bins = 1000) + labs(x = i))
}
Because of the extreme outliers on most of the numeric features, it
is impossible to gain any insight from the distributions. This is why we
will first remove these outliers.
Other than this, we see that claim_gold_cnt only has one value “0”. This
means that we can remove this column.
We experiment with outliers a bit:
x <- x[,!names(x) %in% c("claim_gold_cnt")]
for (i in numeric_names) {
print(i)
x_no_outliers <- x[,!names(x) %in% c(i)]
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))
for(i in numeric_no_outliers){
quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(x[,i])
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
x_no_outliers <- subset(x_no_outliers, x_no_outliers[,i] > Lower & x_no_outliers[,i] < Upper)
}
print(dim(x_no_outliers))
}
## [1] "X_id"
## [1] 0 22
## [1] "session_cnt"
## [1] 0 22
## [1] "session_length"
## [1] 0 22
## [1] "max_lvl_no"
## [1] 0 22
## [1] "gameplay_duration"
## [1] 0 22
## [1] "bonus_cnt"
## [1] 0 22
## [1] "hint1_cnt"
## [1] 17775 22
## [1] "hint2_cnt"
## [1] 0 22
## [1] "hint3_cnt"
## [1] 0 22
## [1] "repeat_cnt"
## [1] 0 22
## [1] "gold_cnt"
## [1] 0 22
## [1] "claim_gold_cnt"
## [1] 0 23
## [1] "banner_cnt"
## [1] 0 22
## [1] "is_cnt"
## [1] 0 22
## [1] "rv_cnt"
## [1] 0 22
## [1] "max_ses_length"
## [1] 0 22
## [1] "median_ses_length"
## [1] 0 22
## [1] "avg_ses_length"
## [1] 0 22
We know that the IQR of the hint1_cnt is 0 from the previous statistics summary on numerical variables table. We write a code to take out all the outliers of all columns except one of them, and do this for all columns.
We find that (as expected because the (IQR = 0) when we also take out the hint1_cnt outliers, it takes out all of the rows. When we take all outliers out but the hint1_cnt’s outliers, we got ~18k rows left.
x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_no_outliers <- x
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))
for(i in numeric_no_outliers){
quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(x[,i])
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
x_no_outliers <- subset(x, x[,i] > Lower & x[,i] < Upper)
print(i)
print(dim(x_no_outliers))
}
## [1] "X_id"
## [1] 30000 23
## [1] "session_cnt"
## [1] 27032 23
## [1] "session_length"
## [1] 26713 23
## [1] "max_lvl_no"
## [1] 27906 23
## [1] "gameplay_duration"
## [1] 26824 23
## [1] "bonus_cnt"
## [1] 26738 23
## [1] "hint1_cnt"
## [1] 0 23
## [1] "hint2_cnt"
## [1] 26036 23
## [1] "hint3_cnt"
## [1] 25311 23
## [1] "repeat_cnt"
## [1] 26591 23
## [1] "gold_cnt"
## [1] 28510 23
## [1] "banner_cnt"
## [1] 27011 23
## [1] "is_cnt"
## [1] 27077 23
## [1] "rv_cnt"
## [1] 26569 23
## [1] "max_ses_length"
## [1] 27448 23
## [1] "median_ses_length"
## [1] 27579 23
## [1] "avg_ses_length"
## [1] 27619 23
Now we take outliers of columns one by one and see no problems other than hint1_cnt.
x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_hint1removed <- x[,!names(x) %in% c("hint1_cnt")]
x_no_outliers <- x_hint1removed
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))
for(i in numeric_no_outliers){
quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(x[,i])
Lower <- quartiles[1] - 10*IQR
Upper <- quartiles[2] + 10*IQR
x_no_outliers <- subset(x_no_outliers, x_no_outliers[,i] > Lower & x_no_outliers[,i] < Upper)
print(i)
print(dim(x_no_outliers))
print(ggplot(x_no_outliers, aes(x_no_outliers[,i])) + geom_histogram(bins = 200, fill = "purple") + labs(x = i))
}
## [1] "X_id"
## [1] 30000 22
## [1] "session_cnt"
## [1] 29292 22
## [1] "session_length"
## [1] 29019 22
## [1] "max_lvl_no"
## [1] 28980 22
## [1] "gameplay_duration"
## [1] 28918 22
## [1] "bonus_cnt"
## [1] 28792 22
## [1] "hint2_cnt"
## [1] 28678 22
## [1] "hint3_cnt"
## [1] 27127 22
## [1] "repeat_cnt"
## [1] 27096 22
## [1] "gold_cnt"
## [1] 27092 22
## [1] "banner_cnt"
## [1] 27088 22
## [1] "is_cnt"
## [1] 27083 22
## [1] "rv_cnt"
## [1] 27003 22
## [1] "max_ses_length"
## [1] 26886 22
## [1] "median_ses_length"
## [1] 26832 22
## [1] "avg_ses_length"
## [1] 26832 22
This is when we take all outliers cumulatively (except hint1_cnt).
When we do all of them with 1.5 iqr, we got ~18k rows left. I did it with 10 iqr to not lose as many data (we lost ~3.2k data which is significantly smaller than 12k) and remove the most extreme outliers at the same time and plotted again.
This can be useful for future models and feature engineering.
Now we draw the same plots without the outliers (one by
one for all columns except hint1_cnt).
x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_no_outliers <- x[,!names(x) %in% c("hint1_cnt")]
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))
for(i in numeric_no_outliers){
quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(x[,i])
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
x_no_outliers <- subset(x, x[,i] > Lower & x[,i] < Upper)
print(i)
print(dim(x_no_outliers))
print(ggplot(x_no_outliers, aes(x_no_outliers[,i])) + geom_histogram(bins = 200, fill = "purple") + labs(x = i))
}
## [1] "X_id"
## [1] 30000 23
## [1] "session_cnt"
## [1] 27032 23
## [1] "session_length"
## [1] 26713 23
## [1] "max_lvl_no"
## [1] 27906 23
## [1] "gameplay_duration"
## [1] 26824 23
## [1] "bonus_cnt"
## [1] 26738 23
## [1] "hint2_cnt"
## [1] 26036 23
## [1] "hint3_cnt"
## [1] 25311 23
## [1] "repeat_cnt"
## [1] 26591 23
## [1] "gold_cnt"
## [1] 28510 23
## [1] "banner_cnt"
## [1] 27011 23
## [1] "is_cnt"
## [1] 27077 23
## [1] "rv_cnt"
## [1] 26569 23
## [1] "max_ses_length"
## [1] 27448 23
## [1] "median_ses_length"
## [1] 27579 23
## [1] "avg_ses_length"
## [1] 27619 23
Now we do a qualitative correlation analysis between churn and numerical features:
x <- x[,!names(x) %in% c("claim_gold_cnt")]
x_no_outliers <- x[,!names(x) %in% c("hint1_cnt")]
numeric_no_outliers <- names(Filter(is.numeric, x_no_outliers))
for(i in numeric_no_outliers){
quartiles <- quantile(x[,i], probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(x[,i])
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
x_no_outliers <- subset(x, x[,i] > Lower & x[,i] < Upper)
print(i)
print(dim(x_no_outliers))
print(ggplot(x_no_outliers, aes(x_no_outliers[,i], color = churn)) + geom_histogram(bins = 1000) + labs(x = i) + facet_wrap(~churn))
}
## [1] "X_id"
## [1] 30000 23
## [1] "session_cnt"
## [1] 27032 23
## [1] "session_length"
## [1] 26713 23
## [1] "max_lvl_no"
## [1] 27906 23
## [1] "gameplay_duration"
## [1] 26824 23
## [1] "bonus_cnt"
## [1] 26738 23
## [1] "hint2_cnt"
## [1] 26036 23
## [1] "hint3_cnt"
## [1] 25311 23
## [1] "repeat_cnt"
## [1] 26591 23
## [1] "gold_cnt"
## [1] 28510 23
## [1] "banner_cnt"
## [1] 27011 23
## [1] "is_cnt"
## [1] 27077 23
## [1] "rv_cnt"
## [1] 26569 23
## [1] "max_ses_length"
## [1] 27448 23
## [1] "median_ses_length"
## [1] 27579 23
## [1] "avg_ses_length"
## [1] 27619 23
We see that lower values of X_id has way bigger churn ratio. Earlier versions of the game being worse may cause this.
Users who has one session has smaller churn ratio than average. This means that we can turn this feature into a binary categorical variable with attributes session cnt 1 or more.
People who didn’t gain much levels on their account has bigger churn ratios (as one would expect, because they did not play the game as much), and as the level increases the churn ratio is way smaller. After 10th level, there is a strong downhill trend for churners, which means their chances of churning gets smaller and smaller. After 25th level, churn ratio becomes less than %50.
One would expect gameplay duration to give valuable information like level count, but it seems that the churning behaviour isn’t strongly correlated with gameplay duration like max level count.
People who use hint 3 have way bigger churn ratio than ones who don’t use it.
There is a spike in churn ratio where people had ~500 gold and ~1000 gold. This could also be turned to a categorical variable with attributes gold 500 or 1000 = yes and no.
It seems max_ses_length has a spike in churn ratio between ~250 and ~1500.
As average session length increases, churn ratio decreases with it having its max values between ~250 and ~750.
I will use Chi-square test and information gain for categorical variables.
categorical_names <- names(Filter(is.factor, x))
for (i in categorical_names) {
print(i)
print(chisq.test(x$churn, x[, i], correct=FALSE))
}
## [1] "country"
## Warning in chisq.test(x$churn, x[, i], correct = FALSE): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: x$churn and x[, i]
## X-squared = 937.8, df = 70, p-value < 2.2e-16
##
## [1] "device_brand"
## Warning in chisq.test(x$churn, x[, i], correct = FALSE): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: x$churn and x[, i]
## X-squared = 537.56, df = 112, p-value < 2.2e-16
##
## [1] "device_model"
## Warning in chisq.test(x$churn, x[, i], correct = FALSE): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: x$churn and x[, i]
## X-squared = 3854.2, df = 1719, p-value < 2.2e-16
##
## [1] "lang"
##
## Pearson's Chi-squared test
##
## data: x$churn and x[, i]
## X-squared = 789.93, df = 27, p-value < 2.2e-16
##
## [1] "churn"
##
## Pearson's Chi-squared test
##
## data: x$churn and x[, i]
## X-squared = 30000, df = 1, p-value < 2.2e-16
##
## [1] "multi_lang"
##
## Pearson's Chi-squared test
##
## data: x$churn and x[, i]
## X-squared = 166.16, df = 1, p-value < 2.2e-16
All of our categorical variables have very small p values, meaning they are strongly correlated with the response column.
library(FSelectorRcpp)
x_categoricals <- x[, sapply(x, is.factor)]
information_gain(formula = churn~., data = x_categoricals)
## attributes importance
## 1 country 0.015812466
## 2 device_brand 0.009781128
## 3 device_model 0.072322337
## 4 lang 0.013286518
## 5 multi_lang 0.003236746
One can show that the we gain lots of information from device_model, acceptable information from country & lang and very little & no significant information from device_brand or multi_lang.
To see correlations of numeric variables with respect to churn, we use churn as an integer and compute the correlation matrix of our numerical features using corrplot library.
x_churnasint <- read.csv("ml_project_2022_churn_train.csv")
churn_int <- x_churnasint$churn
x_numeric <- x[, sapply(x, is.numeric)]
x_numeric_churnasint <- cbind(x_numeric, churn_int)
library(corrplot)
## corrplot 0.92 loaded
M <- cor(x_numeric_churnasint)
corrplot(M, method="color")
corrplot(M, method="number")
We see that the variables that are most correlated with churn are id, max_lvl_no, and max_ses_length.
session_cnt is more than %80 correlated with repeat, banner, and is counts.
bonus_count also is more than %80 correlated with repeat cnt.
library(stats)
x_numeric <- x[, sapply(x, is.numeric)]
x_numeric_churn <- cbind(x_numeric, x$churn)
numeric_names <- names(Filter(is.integer, x))
for (i in numeric_names) {
print(i)
print(t.test(x[,i]~x$churn, data = x_numeric_churn))
}
## [1] "X_id"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = 31.96, df = 24074, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 723446.8 817979.3
## sample estimates:
## mean in group 0 mean in group 1
## 91574158 90803445
##
## [1] "session_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -8.0774, df = 25327, p-value = 6.908e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.045189 -3.074799
## sample estimates:
## mean in group 0 mean in group 1
## 6.036455 10.096448
##
## [1] "session_length"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -4.7324, df = 29640, p-value = 2.229e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -4745.812 -1965.974
## sample estimates:
## mean in group 0 mean in group 1
## 6978.81 10334.70
##
## [1] "max_lvl_no"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = 17.353, df = 14041, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 12.03940 15.10552
## sample estimates:
## mean in group 0 mean in group 1
## 30.22043 16.64797
##
## [1] "gameplay_duration"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -1.5998, df = 20047, p-value = 0.1097
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -7105.7318 719.1563
## sample estimates:
## mean in group 0 mean in group 1
## 5407.839 8601.127
##
## [1] "bonus_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -1.5341, df = 18515, p-value = 0.125
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -19.123202 2.331502
## sample estimates:
## mean in group 0 mean in group 1
## 34.87840 43.27425
##
## [1] "hint1_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -2.5713, df = 22454, p-value = 0.01014
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.8806095 -0.1187855
## sample estimates:
## mean in group 0 mean in group 1
## 0.1488563 0.6485538
##
## [1] "hint2_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -8.5327, df = 29987, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.9752406 -0.6108895
## sample estimates:
## mean in group 0 mean in group 1
## 1.726501 2.519566
##
## [1] "hint3_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -1.6615, df = 14271, p-value = 0.09664
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -9.8285751 0.8105005
## sample estimates:
## mean in group 0 mean in group 1
## 8.607934 13.116972
##
## [1] "repeat_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -3.752, df = 25447, p-value = 0.0001758
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -36.13017 -11.33456
## sample estimates:
## mean in group 0 mean in group 1
## 56.99115 80.72352
##
## [1] "gold_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = 4.8555, df = 17353, p-value = 1.211e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 92.94369 218.78462
## sample estimates:
## mean in group 0 mean in group 1
## 941.0042 785.1400
##
## [1] "banner_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -2.3004, df = 26228, p-value = 0.02144
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -36.395104 -2.907029
## sample estimates:
## mean in group 0 mean in group 1
## 152.7311 172.3821
##
## [1] "is_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -2.7922, df = 28191, p-value = 0.005239
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.522520 -0.966992
## sample estimates:
## mean in group 0 mean in group 1
## 20.78422 24.02898
##
## [1] "rv_cnt"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -1.1253, df = 15782, p-value = 0.2605
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.495740 1.487027
## sample estimates:
## mean in group 0 mean in group 1
## 9.129736 11.134092
##
## [1] "max_ses_length"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = -2.0159, df = 23626, p-value = 0.04382
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1447.62949 -20.32813
## sample estimates:
## mean in group 0 mean in group 1
## 2802.796 3536.775
##
## [1] "median_ses_length"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = 34.374, df = 18995, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 517.5379 580.1295
## sample estimates:
## mean in group 0 mean in group 1
## 1209.9140 661.0802
##
## [1] "avg_ses_length"
##
## Welch Two Sample t-test
##
## data: x[, i] by x$churn
## t = 9.4766, df = 23307, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 339.8406 517.0783
## sample estimates:
## mean in group 0 mean in group 1
## 1365.7693 937.3099
for (i in numeric_names) {
print(i)
print(oneway.test(x[,i]~churn, data = x, var.equal = FALSE))
}
## [1] "X_id"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 1021.5, num df = 1, denom df = 24074, p-value < 2.2e-16
##
## [1] "session_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 65.244, num df = 1, denom df = 25327, p-value = 6.908e-16
##
## [1] "session_length"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 22.396, num df = 1, denom df = 29640, p-value = 2.229e-06
##
## [1] "max_lvl_no"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 301.14, num df = 1, denom df = 14041, p-value < 2.2e-16
##
## [1] "gameplay_duration"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 2.5593, num df = 1, denom df = 20047, p-value = 0.1097
##
## [1] "bonus_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 2.3534, num df = 1, denom df = 18515, p-value = 0.125
##
## [1] "hint1_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 6.6116, num df = 1, denom df = 22454, p-value = 0.01014
##
## [1] "hint2_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 72.806, num df = 1, denom df = 29987, p-value < 2.2e-16
##
## [1] "hint3_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 2.7605, num df = 1, denom df = 14271, p-value = 0.09664
##
## [1] "repeat_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 14.078, num df = 1, denom df = 25447, p-value = 0.0001758
##
## [1] "gold_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 23.576, num df = 1, denom df = 17353, p-value = 1.211e-06
##
## [1] "banner_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 5.2916, num df = 1, denom df = 26228, p-value = 0.02144
##
## [1] "is_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 7.7961, num df = 1, denom df = 28191, p-value = 0.005239
##
## [1] "rv_cnt"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 1.2662, num df = 1, denom df = 15782, p-value = 0.2605
##
## [1] "max_ses_length"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 4.0638, num df = 1, denom df = 23626, p-value = 0.04382
##
## [1] "median_ses_length"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 1181.6, num df = 1, denom df = 18995, p-value < 2.2e-16
##
## [1] "avg_ses_length"
##
## One-way analysis of means (not assuming equal variances)
##
## data: x[, i] and churn
## F = 89.807, num df = 1, denom df = 23307, p-value < 2.2e-16
The results (p values) of the t test and ANOVA are identical. We can say that the variables that have p value less than 0.05 are significant, and more than 0.05 are not significant.
Non-significant variables:
rv_cnt
hint3_cnt
gameplay_duration
Significant variables:
max_ses_length
banner_cnt
hint1_cnt
Highly significant variables:
avg_ses_length
median_ses_length
gold_cnt
hint2_cnt
max_lvl_no
session_length
session_cnt
X_id
We will fit logistic regression with all numerical variables (except claim_gold_cnt), with significant and highly significant variables, and only with highly significant variables and compare the results of accuracy and coefficients.
All numeric variables
x_all_numeric <- x[, sapply(x, is.numeric)]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)
x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]
#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6100 -1.1371 0.6929 0.8858 7.0236
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.054e+01 7.085e-01 28.996 < 2e-16 ***
## X_id -2.149e-07 7.748e-09 -27.735 < 2e-16 ***
## session_cnt 2.082e-02 3.412e-03 6.101 1.06e-09 ***
## session_length -5.641e-06 1.262e-06 -4.469 7.87e-06 ***
## max_lvl_no -2.323e-02 1.084e-03 -21.440 < 2e-16 ***
## gameplay_duration -1.612e-06 1.239e-06 -1.301 0.193349
## bonus_cnt -1.483e-03 2.180e-04 -6.803 1.03e-11 ***
## hint1_cnt -3.357e-03 1.790e-03 -1.875 0.060727 .
## hint2_cnt 5.173e-02 4.792e-03 10.793 < 2e-16 ***
## hint3_cnt 4.975e-04 6.673e-04 0.746 0.455914
## repeat_cnt 7.273e-04 3.592e-04 2.025 0.042910 *
## gold_cnt 1.686e-04 1.218e-05 13.844 < 2e-16 ***
## banner_cnt -2.582e-04 1.236e-04 -2.090 0.036658 *
## is_cnt 5.791e-03 9.917e-04 5.839 5.26e-09 ***
## rv_cnt -2.259e-03 5.585e-04 -4.045 5.23e-05 ***
## max_ses_length 1.189e-05 3.382e-06 3.515 0.000439 ***
## median_ses_length -4.509e-04 3.109e-05 -14.506 < 2e-16 ***
## avg_ses_length 3.663e-05 2.447e-05 1.497 0.134434
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26437 on 19999 degrees of freedom
## Residual deviance: 23620 on 19982 degrees of freedom
## AIC: 23656
##
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
## 0 1
## 2823 7177
act_churn <- x_all_numeric_churn_v$churn
table(lg.pred,act_churn)
## act_churn
## lg.pred 0 1
## 0 1816 1007
## 1 1901 5276
mean(lg.pred==act_churn)
## [1] 0.7092
We have 70.92% accuracy.
Significant numeric variables
x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c("rv_cnt", "gameplay_duration", "hint3_cnt")]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)
x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]
#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6704 -1.1376 0.6932 0.8854 7.2544
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.051e+01 7.081e-01 28.963 < 2e-16 ***
## X_id -2.144e-07 7.744e-09 -27.692 < 2e-16 ***
## session_cnt 1.891e-02 3.291e-03 5.746 9.15e-09 ***
## session_length -5.471e-06 1.146e-06 -4.772 1.82e-06 ***
## max_lvl_no -2.289e-02 1.075e-03 -21.306 < 2e-16 ***
## bonus_cnt -1.256e-03 2.031e-04 -6.187 6.14e-10 ***
## hint1_cnt -3.949e-03 1.742e-03 -2.266 0.023427 *
## hint2_cnt 5.156e-02 4.820e-03 10.696 < 2e-16 ***
## repeat_cnt 4.932e-04 3.500e-04 1.409 0.158741
## gold_cnt 1.651e-04 1.256e-05 13.139 < 2e-16 ***
## banner_cnt -3.768e-04 1.153e-04 -3.269 0.001078 **
## is_cnt 5.956e-03 9.883e-04 6.027 1.67e-09 ***
## max_ses_length 1.169e-05 3.252e-06 3.596 0.000323 ***
## median_ses_length -4.553e-04 3.111e-05 -14.637 < 2e-16 ***
## avg_ses_length 3.684e-05 2.460e-05 1.497 0.134270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26437 on 19999 degrees of freedom
## Residual deviance: 23634 on 19985 degrees of freedom
## AIC: 23664
##
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
## 0 1
## 2820 7180
act_churn <- x_all_numeric_churn_v$churn
table(lg.pred,act_churn)
## act_churn
## lg.pred 0 1
## 0 1811 1009
## 1 1906 5274
mean(lg.pred==act_churn)
## [1] 0.7085
We have %70.85 accuracy
Highly significant numeric variables
x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c("rv_cnt", "gameplay_duration", "hint3_cnt", "max_session_length", "hint1_cnt", "banner_cnt")]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)
x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]
#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6223 -1.1390 0.6940 0.8861 7.0999
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.052e+01 7.078e-01 28.985 < 2e-16 ***
## X_id -2.144e-07 7.741e-09 -27.694 < 2e-16 ***
## session_cnt 1.477e-02 3.091e-03 4.777 1.78e-06 ***
## session_length -5.238e-06 1.160e-06 -4.514 6.36e-06 ***
## max_lvl_no -2.290e-02 1.082e-03 -21.161 < 2e-16 ***
## bonus_cnt -9.972e-04 2.059e-04 -4.844 1.27e-06 ***
## hint2_cnt 5.122e-02 4.883e-03 10.489 < 2e-16 ***
## repeat_cnt 1.861e-04 3.408e-04 0.546 0.584993
## gold_cnt 1.649e-04 1.308e-05 12.608 < 2e-16 ***
## is_cnt 4.797e-03 9.210e-04 5.208 1.90e-07 ***
## max_ses_length 1.117e-05 3.291e-06 3.393 0.000691 ***
## median_ses_length -4.680e-04 3.113e-05 -15.034 < 2e-16 ***
## avg_ses_length 3.810e-05 2.496e-05 1.526 0.126887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26437 on 19999 degrees of freedom
## Residual deviance: 23646 on 19987 degrees of freedom
## AIC: 23672
##
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
## 0 1
## 2804 7196
act_churn <- x_all_numeric_churn_v$churn
table(lg.pred,act_churn)
## act_churn
## lg.pred 0 1
## 0 1796 1008
## 1 1921 5275
mean(lg.pred==act_churn)
## [1] 0.7071
We have %70.71 accuracy.
Now, we will try removing some highly correlated features and see if our accuracy will increase.
We will remove repeat, banner, and is counts.
x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c("repeat_cnt", "banner_cnt", "is_cnt")]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)
x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]
#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_all_numeric_churn_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.3031 -1.1436 0.6959 0.8866 6.8192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.029e+01 7.062e-01 28.733 < 2e-16 ***
## X_id -2.125e-07 7.728e-09 -27.495 < 2e-16 ***
## session_cnt 2.929e-02 2.984e-03 9.818 < 2e-16 ***
## session_length -5.279e-06 1.460e-06 -3.615 0.000301 ***
## max_lvl_no -2.085e-02 1.000e-03 -20.849 < 2e-16 ***
## gameplay_duration -8.792e-07 1.084e-06 -0.811 0.417256
## bonus_cnt -7.855e-04 1.171e-04 -6.708 1.98e-11 ***
## hint1_cnt -3.856e-03 2.098e-03 -1.838 0.066061 .
## hint2_cnt 5.348e-02 4.824e-03 11.087 < 2e-16 ***
## hint3_cnt 4.842e-04 8.398e-04 0.577 0.564224
## gold_cnt 1.557e-04 1.110e-05 14.024 < 2e-16 ***
## rv_cnt -1.806e-03 5.505e-04 -3.282 0.001032 **
## max_ses_length 1.045e-05 3.669e-06 2.847 0.004414 **
## median_ses_length -4.391e-04 3.084e-05 -14.237 < 2e-16 ***
## avg_ses_length 4.206e-05 2.457e-05 1.712 0.086842 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26437 on 19999 degrees of freedom
## Residual deviance: 23670 on 19985 degrees of freedom
## AIC: 23700
##
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
## 0 1
## 2780 7220
act_churn <- x_all_numeric_churn_v$churn
table(lg.pred,act_churn)
## act_churn
## lg.pred 0 1
## 0 1778 1002
## 1 1939 5281
mean(lg.pred==act_churn)
## [1] 0.7059
Our accuracy is %70.59.
Logistic regression on all numeric variables but one (iterate through all)
for (i in numeric_names) {
x_all_numeric <- x[, sapply(x, is.numeric)]
x_all_numeric <- x_all_numeric[,!names(x_all_numeric) %in% c(i)]
churn <- x$churn
x_all_numeric_churn <- cbind(x_all_numeric, churn)
x_all_numeric_churn_t = x_all_numeric_churn[1:20000, ]
x_all_numeric_churn_v = x_all_numeric_churn[20001:30000, ]
#model and coefficients
lg <- glm(churn~., data = x_all_numeric_churn_t, family = "binomial")
summary(lg)
#prediction and accuracy
lg.probs <- predict(lg, x_all_numeric_churn_v, type="response")
lg.pred = rep("0", 10000)
lg.pred[lg.probs >.55]="1"
act_churn <- x_all_numeric_churn_v$churn
print(i)
print(mean(lg.pred==act_churn))
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "X_id"
## [1] 0.6921
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "session_cnt"
## [1] 0.7058
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "session_length"
## [1] 0.7088
## [1] "max_lvl_no"
## [1] 0.6863
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "gameplay_duration"
## [1] 0.7095
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "bonus_cnt"
## [1] 0.7072
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "hint1_cnt"
## [1] 0.7092
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "hint2_cnt"
## [1] 0.7044
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "hint3_cnt"
## [1] 0.7092
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "repeat_cnt"
## [1] 0.7088
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "gold_cnt"
## [1] 0.709
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "banner_cnt"
## [1] 0.7082
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "is_cnt"
## [1] 0.7066
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "rv_cnt"
## [1] 0.7084
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "max_ses_length"
## [1] 0.7089
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "median_ses_length"
## [1] 0.7067
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] "avg_ses_length"
## [1] 0.7092
We see that when we remove “avg_ses_length”, “hint3_cnt”, or “hint1_cnt” -> our accuracy does not change
When we remove “gameplay_duration” -> our accuracy increases
When we remove “X_id” or “max_lvl_no” -> our accuracy decreases by a lot
Conclusion & Insights from Logistic Regression Experiments
Our accuracy did not improve when we removed the features that ANOVA and T test deemed as “not significant” (large p value).
avg_ses_length wasn’t of any importance to our logistic regression model even though it had a small p value.
gameplay_duration had a negative impact on the accuracy of our model.
id and max_lvl_no had significant positive impact on our model.
Formulating Skill
max_lvl_cnt/gameplay_duration -> gives us how far did player reach by the time he/she played the game.
This might seem reasonable at first, but we are not taking into account that it will probably be harder to gain levels as we move forward in the game. This means that if we use this formula as skill, we are punishing people who move forward in the game and someone who played very little and passed only one or two levels can come across as high skilled players.
We can try squaring the level count to try to balance the penalty comes from game getting harder by time:
skill = (max_lvl_cnt)^2/gameplay_duration
Another thing we did not take into account is usage of features in the game that helps players get levels but has nothing to do with skill. If a person uses bonus or repeat features, or watches rewarded ads which usually boost you in the game, or use the hints more, it will be way easier to gain levels for the player. So we should penalize people who uses those features if we are formulating “skill”. We add this to the formula as:
skill = (max_lvl_cnt)^2/gameplay_duration+repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt
Although, (for example) the gameplay_duration will be much higher in units than the other features, it will dominate the whole formula. Because of this issue, we should normalize all these features before taking them into account.
Normalizing the features:
library(BBmisc)
##
## Attaching package: 'BBmisc'
## The following objects are masked from 'package:sjmisc':
##
## %nin%, seq_col, seq_row
## The following object is masked from 'package:base':
##
## isFALSE
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#penalized features -> repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt
library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no/gameplay_duration]
str(xvdt)
## Classes 'data.table' and 'data.frame': 30000 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 112 42.7 162.5 108 210.7 ...
## - attr(*, ".internal.selfref")=<externalptr>
quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(xvdt$skill)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)
dim(xvdt_skill_no)
## [1] 28857 24
library(ggplot2)
ggplot(xvdt_skill_no, aes(skill)) + geom_histogram(bins = 1000) + facet_wrap(xvdt_skill_no$churn)
xvdt_skill_no$churn <- as.numeric(xvdt_skill_no$churn)
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame': 28857 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : num 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 112 42.7 162.5 108 210.7 ...
## - attr(*, ".internal.selfref")=<externalptr>
dim(xvdt_skill_no)
## [1] 28857 24
print(cor(xvdt_skill_no$skill, xvdt_skill_no$churn))
## [1] -0.2346475
When we create a new column skill -> max_lvl_no/gameplay_duration, we see the distribution as in the plot and we find the correlation of new feature with the response column as -%23, which means it is highly correlated compared to other features and people seem to get less churned as they are more skilled.
library(BBmisc)
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#penalized features -> repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt
library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no^2/gameplay_duration]
str(xvdt)
## Classes 'data.table' and 'data.frame': 30000 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 0.3717 0.0945 1.2133 0.3885 0.6412 ...
## - attr(*, ".internal.selfref")=<externalptr>
quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(xvdt$skill)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)
dim(xvdt_skill_no)
## [1] 28018 24
library(ggplot2)
ggplot(xvdt_skill_no, aes(skill)) + geom_histogram(bins = 1000) + facet_wrap(xvdt_skill_no$churn)
xvdt_skill_no$churn <- as.numeric(xvdt_skill_no$churn)
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame': 28018 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : num 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 0.3717 0.0945 1.2133 0.3885 0.6412 ...
## - attr(*, ".internal.selfref")=<externalptr>
print(cor(xvdt_skill_no$skill, xvdt_skill_no$churn))
## [1] -0.3210445
dim(xvdt_skill_no)
## [1] 28018 24
We see that when take the game becoming harder by time into account thus squaring the max_lvl_count as we talked about before, correlation is even higher with %32, although we need to remove 2k rows which are outliers for the skill feature. This is a sacrifice that can be made for a feature that is %32 correlated with our response column because we have no other feature that is even close to this number.
library(BBmisc)
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#penalized features -> repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt
library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no^2/gameplay_duration+repeat_cnt+bonus_cnt+rv_cnt+hint1_cnt+hint2_cnt+hint3_cnt]
str(xvdt)
## Classes 'data.table' and 'data.frame': 30000 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 0.376 0.097 1.214 0.389 0.643 ...
## - attr(*, ".internal.selfref")=<externalptr>
quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(xvdt$skill)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)
dim(xvdt_skill_no)
## [1] 27984 24
library(ggplot2)
ggplot(xvdt_skill_no, aes(skill)) + geom_histogram(bins = 1000) + facet_wrap(xvdt_skill_no$churn)
xvdt_skill_no$churn <- as.numeric(xvdt_skill_no$churn)
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame': 27984 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : num 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 0.376 0.097 1.214 0.389 0.643 ...
## - attr(*, ".internal.selfref")=<externalptr>
print(cor(xvdt_skill_no$skill, xvdt_skill_no$churn))
## [1] -0.3200455
dim(xvdt_skill_no)
## [1] 27984 24
However when we add the penalizing features into equation that boosts a players levels, corelation does not improve. This is why we will not use them and use our new column skill as (max_lvl_cnt)^2/gameplay_duration.
Experimenting with logistic regression when skill added
Add new feature:
#normalize
library(BBmisc)
x_normalized <- x
x_normalized = normalize(x_normalized, method = "range", range = c(0, 1))
#add new column
library(data.table)
xvdt <- data.table(x_normalized)
xvdt[, skill := max_lvl_no^2/gameplay_duration]
str(xvdt)
## Classes 'data.table' and 'data.frame': 30000 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 0.3717 0.0945 1.2133 0.3885 0.6412 ...
## - attr(*, ".internal.selfref")=<externalptr>
#remove outliers from skill
quartiles <- quantile(xvdt$skill, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(xvdt$skill)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
xvdt_skill_no <- subset(xvdt, xvdt$skill > Lower & xvdt$skill < Upper)
#check the work
str(xvdt_skill_no)
## Classes 'data.table' and 'data.frame': 28018 obs. of 24 variables:
## $ X_id : num 0.000118 0.000391 0.000756 0.001122 0.001394 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : num 0.002616 0 0 0.000436 0 ...
## $ session_length : num 0.002579 0.000352 0.000417 0.000252 0.000128 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : num 0.00332 0.00221 0.00747 0.0036 0.00304 ...
## $ gameplay_duration: num 2.96e-05 5.18e-05 4.60e-05 3.33e-05 1.44e-05 ...
## $ bonus_cnt : num 0.000168 0.000252 0.000224 0.000056 0.000112 ...
## $ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : num 0.00355 0.00178 0 0 0.00178 ...
## $ hint3_cnt : num 0 0 0 0 0 ...
## $ repeat_cnt : num 0.00023 0.000328 0.000689 0.00023 0.000328 ...
## $ gold_cnt : num 0.001926 0.000115 0.00456 0.006265 0.006297 ...
## $ banner_cnt : num 0.000556 0.001242 0 0.001601 0.000752 ...
## $ is_cnt : num 0.000114 0 0.001824 0.000228 0 ...
## $ rv_cnt : num 0.00 7.41e-05 7.41e-05 7.41e-05 7.41e-05 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : num 0.002441 0.000353 0.000418 0.000243 0.000128 ...
## $ median_ses_length: num 0.00115 0.02083 0.02465 0.00747 0.00767 ...
## $ avg_ses_length : num 0.002705 0.002602 0.003083 0.000922 0.000947 ...
## $ skill : num 0.3717 0.0945 1.2133 0.3885 0.6412 ...
## - attr(*, ".internal.selfref")=<externalptr>
dim(xvdt_skill_no)
## [1] 28018 24
Fit and calculate accuracy logistic regression with only numerical variables:
xvdt_skill_no <- as.data.frame(xvdt_skill_no)
xvdt_skill_no_all_numeric <- xvdt_skill_no[, sapply(xvdt_skill_no, is.numeric)]
churn <- xvdt_skill_no$churn
xvdt_skill_no_all_numeric_churn <- cbind(xvdt_skill_no_all_numeric, churn)
xvdt_skill_no_all_numeric_churn_t = xvdt_skill_no_all_numeric_churn[1:20000, ]
xvdt_skill_no_all_numeric_churn_v = xvdt_skill_no_all_numeric_churn[20001:28000, ]
#model and coefficients
lg <- glm(churn~., data = xvdt_skill_no_all_numeric_churn_t, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = xvdt_skill_no_all_numeric_churn_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7880 -1.0229 0.6018 0.8350 5.2618
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.08846 0.04365 47.842 < 2e-16 ***
## X_id -1.53990 0.05412 -28.452 < 2e-16 ***
## session_cnt 15.81115 6.09857 2.593 0.00953 **
## session_length -12.97120 5.10267 -2.542 0.01102 *
## max_lvl_no -46.51730 6.38531 -7.285 3.22e-13 ***
## gameplay_duration -92.49887 48.30935 -1.915 0.05553 .
## bonus_cnt -19.89776 4.82958 -4.120 3.79e-05 ***
## hint1_cnt -6.27388 4.52125 -1.388 0.16525
## hint2_cnt 20.25634 2.90030 6.984 2.86e-12 ***
## hint3_cnt 6.46183 11.15791 0.579 0.56250
## repeat_cnt 1.80539 5.38264 0.335 0.73732
## gold_cnt 48.53020 7.13909 6.798 1.06e-11 ***
## banner_cnt -6.96282 2.91152 -2.391 0.01678 *
## is_cnt 39.34391 7.07149 5.564 2.64e-08 ***
## rv_cnt -7.20974 6.41170 -1.124 0.26082
## max_ses_length 46.28829 26.20220 1.767 0.07730 .
## median_ses_length -33.42723 2.94380 -11.355 < 2e-16 ***
## avg_ses_length 13.35168 20.55285 0.650 0.51593
## skill -1.27390 0.05013 -25.414 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26006 on 19999 degrees of freedom
## Residual deviance: 22305 on 19981 degrees of freedom
## AIC: 22343
##
## Number of Fisher Scoring iterations: 6
#prediction and accuracy
lg.probs <- predict(lg, xvdt_skill_no_all_numeric_churn_v, type="response")
lg.pred = rep("0", 8000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
## 0 1
## 2201 5799
act_churn <- xvdt_skill_no_all_numeric_churn_v$churn
table(lg.pred,act_churn)
## act_churn
## lg.pred 0 1
## 0 1397 804
## 1 1402 4397
mean(lg.pred==act_churn)
## [1] 0.72425
As we can see our predicton accuracy improved to %72.4 and the p value of skill is very low. Now we try logistic regression without the variables with high p values in the previous logistic regression:
xvdt_skill_no <- as.data.frame(xvdt_skill_no)
xvdt_skill_no_all_numeric <- xvdt_skill_no[, sapply(xvdt_skill_no, is.numeric)]
churn <- xvdt_skill_no$churn
xvdt_skill_no_all_numeric_churn <- cbind(xvdt_skill_no_all_numeric, churn)
xvdt_skill_no_all_numeric_churn_t = xvdt_skill_no_all_numeric_churn[1:20000, ]
xvdt_skill_no_all_numeric_churn_v = xvdt_skill_no_all_numeric_churn[20001:28000, ]
numeric_names <- append(numeric_names, "skill")
#model and coefficients
lg <- glm(churn~skill, data = xvdt_skill_no_all_numeric_churn_t, family = "binomial")
summary(lg)
##
## Call:
## glm(formula = churn ~ skill, family = "binomial", data = xvdt_skill_no_all_numeric_churn_t)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7910 -1.1685 0.6867 0.8428 1.8962
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.37938 0.02397 57.55 <2e-16 ***
## skill -1.53701 0.03503 -43.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26006 on 19999 degrees of freedom
## Residual deviance: 23828 on 19998 degrees of freedom
## AIC: 23832
##
## Number of Fisher Scoring iterations: 4
#prediction and accuracy
lg.probs <- predict(lg, xvdt_skill_no_all_numeric_churn_v, type="response")
lg.pred = rep("0", 8000)
lg.pred[lg.probs >.55]="1"
table(lg.pred)
## lg.pred
## 0 1
## 1851 6149
act_churn <- xvdt_skill_no_all_numeric_churn_v$churn
mean(lg.pred==act_churn)
## [1] 0.674
We see that when we only fit our model with skill feature, we get a %67.4 accuracy which is pretty good for a single column, so we add this as a feature to our data.
skill <- xvdt$skill
x_skill <- cbind(x, skill)
quartiles <- quantile(x_skill$skill, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(x_skill$skill)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
x_skill <- subset(x_skill, x_skill$skill > Lower & x_skill$skill < Upper)
str(x_skill)
## 'data.frame': 28018 obs. of 24 variables:
## $ X_id : int 88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : int 7 1 1 2 1 3 1 3 1 8 ...
## $ session_length : int 13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : int 13 9 28 14 12 20 9 23 9 1 ...
## $ gameplay_duration: int 1088 1898 1685 1221 532 3018 393 747 382 1435 ...
## $ bonus_cnt : int 6 9 8 2 4 23 3 3 0 15 ...
## $ hint1_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : int 2 1 0 0 1 0 0 0 0 0 ...
## $ hint3_cnt : int 0 0 0 0 0 0 0 0 0 2 ...
## $ repeat_cnt : int 7 10 21 7 10 30 4 1 1 38 ...
## $ gold_cnt : int 418 25 990 1360 1367 585 620 2141 665 500 ...
## $ banner_cnt : int 17 38 0 49 23 154 16 33 9 14 ...
## $ is_cnt : int 1 0 16 2 0 7 0 11 0 2 ...
## $ rv_cnt : int 0 1 1 1 1 3 0 1 1 4 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : int 12451 1809 2141 1248 666 1817 503 449 520 605 ...
## $ median_ses_length: int 100 1809 2141 649 666 1609 503 372 520 193 ...
## $ avg_ses_length : int 1880 1809 2141 649 666 1171 503 379 520 236 ...
## $ skill : num 0.3717 0.0945 1.2133 0.3885 0.6412 ...
str(x)
## 'data.frame': 30000 obs. of 23 variables:
## $ X_id : int 88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
## $ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51 ...
## $ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 90 111 52 52 52 111 111 90 90 90 ...
## $ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 1189 958 169 1485 587 622 970 1211 1204 1234 ...
## $ session_cnt : int 7 1 1 2 1 3 1 3 1 8 ...
## $ session_length : int 13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
## $ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18 ...
## $ max_lvl_no : int 13 9 28 14 12 20 9 23 9 1 ...
## $ gameplay_duration: int 1088 1898 1685 1221 532 3018 393 747 382 1435 ...
## $ bonus_cnt : int 6 9 8 2 4 23 3 3 0 15 ...
## $ hint1_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : int 2 1 0 0 1 0 0 0 0 0 ...
## $ hint3_cnt : int 0 0 0 0 0 0 0 0 0 2 ...
## $ repeat_cnt : int 7 10 21 7 10 30 4 1 1 38 ...
## $ gold_cnt : int 418 25 990 1360 1367 585 620 2141 665 500 ...
## $ banner_cnt : int 17 38 0 49 23 154 16 33 9 14 ...
## $ is_cnt : int 1 0 16 2 0 7 0 11 0 2 ...
## $ rv_cnt : int 0 1 1 1 1 3 0 1 1 4 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : int 12451 1809 2141 1248 666 1817 503 449 520 605 ...
## $ median_ses_length: int 100 1809 2141 649 666 1609 503 372 520 193 ...
## $ avg_ses_length : int 1880 1809 2141 649 666 1171 503 379 520 236 ...
Many of our categorical features have high cardinality (have many unique factor levels) which makes it hard for many models to learn from. For example, if we try to create a decision tree with all our features, we will get an error that says that it only supports factors with levels < 32 (I got this but I don’t include the code here).
We can reduce the cardinality of our features by simply taking all the levels that occur few times in the data and combine them in another category “other”.
Country
x_skill_reduce <- x_skill
country_column <- x_skill_reduce$country
# get the frequency of each level using the table() function
level_counts <- table(country_column)
# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 300]
# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_country_column <- factor(ifelse(country_column %in% levels_to_relevel, "Other", as.character(country_column)))
}
# view the updated levels of the factor column
table(updated_country_column)
## updated_country_column
## BG BR CZ FR GR HU IT LT MX NL Other PL RO
## 2173 1871 1399 2022 2726 1310 2911 1083 541 770 2072 1730 3431
## SE SK TR
## 823 1035 2121
As we can see, the updated_factor_column is exactly what we want with 15 categories are exactly the same but other categories are combined in “other” category.
Device_Brand
x_skill_reduce <- x_skill
device_brand_column <- x_skill_reduce$device_brand
# get the frequency of each level using the table() function
level_counts <- table(device_brand_column)
# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 300]
# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_device_brand_column <- factor(ifelse(device_brand_column %in% levels_to_relevel, "Other", as.character(device_brand_column)))
}
# view the updated levels of the factor column
table(updated_device_brand_column)
## updated_device_brand_column
## HUAWEI motorola OPPO Other samsung Xiaomi
## 5270 1159 821 2277 12617 5874
Device_Model
x_skill_reduce <- x_skill
device_model_column <- x_skill_reduce$device_model
# get the frequency of each level using the table() function
level_counts <- table(device_model_column)
# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 215]
# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_device_model_column <- factor(ifelse(device_model_column %in% levels_to_relevel, "Other", as.character(device_model_column)))
}
# view the updated levels of the factor column
table(updated_device_model_column)
## updated_device_model_column
## ANE-LX1 M2003J15SC M2004J19C
## 280 381 530 367
## MAR-LX1A Other POT-LX1 Redmi Note 7
## 742 17107 386 220
## Redmi Note 8 Redmi Note 8 Pro Redmi Note 8T Redmi Note 9 Pro
## 229 433 291 501
## SM-A105FN SM-A125F SM-A127F SM-A202F
## 323 317 219 501
## SM-A217F SM-A405FN SM-A505FN SM-A515F
## 529 270 432 780
## SM-A528B SM-A705FN SM-A715F SM-G950F
## 219 267 515 253
## SM-G960F SM-G965F SM-G973F SM-G975F
## 266 224 334 217
## SM-G991B SNE-LX1 VOG-L29
## 254 266 365
nlevels(updated_device_model_column)
## [1] 31
Language
x_skill_reduce <- x_skill
lang_column <- x_skill_reduce$lang
# get the frequency of each level using the table() function
level_counts <- table(lang_column)
# create a vector of levels that occur less than 300 times
levels_to_relevel <- names(level_counts)[level_counts < 215]
# create a new factor column with the updated levels using the ifelse() function
for (level in levels_to_relevel) {
updated_lang_column <- factor(ifelse(lang_column %in% levels_to_relevel, "Other", as.character(lang_column)))
}
# view the updated levels of the factor column
table(updated_lang_column)
## updated_lang_column
## BR BU CS ES FR GR HU IT LT NL Other PL RO
## 1858 2198 1391 1079 2025 2725 1374 2892 1112 815 1105 1754 3481
## RU SE SK TR
## 262 802 1020 2125
nlevels(updated_lang_column)
## [1] 17
x_skill_reduced -> dataset with no more than 32 factor levels.
We can now replace our old categorical columns in our data with new columns we created:
x_skill_reduced <- x_skill
x_skill_reduced$country <- updated_country_column
x_skill_reduced$device_brand <- updated_device_brand_column
x_skill_reduced$device_model <- updated_device_model_column
x_skill_reduced$lang <- updated_lang_column
str(x_skill_reduced)
## 'data.frame': 28018 obs. of 24 variables:
## $ X_id : int 88464659 88466479 88468912 88471351 88473161 88474849 88476408 88478719 88480762 88482779 ...
## $ country : Factor w/ 16 levels "BG","BR","CZ",..: 7 2 13 8 13 2 4 13 5 10 ...
## $ device_brand : Factor w/ 6 levels "HUAWEI","motorola",..: 5 6 1 1 1 6 6 5 5 5 ...
## $ device_model : Factor w/ 31 levels "","ANE-LX1","M2003J15SC",..: 14 6 6 30 6 6 9 6 6 20 ...
## $ session_cnt : int 7 1 1 2 1 3 1 3 1 8 ...
## $ session_length : int 13162 1809 2141 1298 666 3514 503 1138 520 1888 ...
## $ lang : Factor w/ 17 levels "BR","BU","CS",..: 8 1 13 9 13 1 5 13 6 10 ...
## $ max_lvl_no : int 13 9 28 14 12 20 9 23 9 1 ...
## $ gameplay_duration: int 1088 1898 1685 1221 532 3018 393 747 382 1435 ...
## $ bonus_cnt : int 6 9 8 2 4 23 3 3 0 15 ...
## $ hint1_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : int 2 1 0 0 1 0 0 0 0 0 ...
## $ hint3_cnt : int 0 0 0 0 0 0 0 0 0 2 ...
## $ repeat_cnt : int 7 10 21 7 10 30 4 1 1 38 ...
## $ gold_cnt : int 418 25 990 1360 1367 585 620 2141 665 500 ...
## $ banner_cnt : int 17 38 0 49 23 154 16 33 9 14 ...
## $ is_cnt : int 1 0 16 2 0 7 0 11 0 2 ...
## $ rv_cnt : int 0 1 1 1 1 3 0 1 1 4 ...
## $ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_ses_length : int 12451 1809 2141 1248 666 1817 503 449 520 605 ...
## $ median_ses_length: int 100 1809 2141 649 666 1609 503 372 520 193 ...
## $ avg_ses_length : int 1880 1809 2141 649 666 1171 503 379 520 236 ...
## $ skill : num 0.3717 0.0945 1.2133 0.3885 0.6412 ...
About inference, there are f1 scores and AUC or AUCPR metrics on the results of the models that I fit and so in the report, but I take accuracy in consideration the most since it is the final object.
I used logistic regression to gain insight on numerical features in previous sections.
Because we don’t have a dimensionality issue (30k rows with ~25 columns), I will not use any dimension reduction methods.
I will also not use any unsupervised learning methods because our aim is to do predictions on a binary classification problems.
Because we have a binary classification problem with limited and non-linear data with categorical features that has high cardinality at hand, I will use advanced tree based methods like Random Forest, Gradient Boosting, packages like XGBoost and AutoML to do predictions on the test set.
I also tried many models outside this report, but none of them could even come close to advanced tree based methods, so I’m not including them in the report.
I created a decision tree only for visualization and getting insight purposes, but didn’t get much insight.
library(tree)
tree_model <- tree(churn~., data = x_skill_reduced)
plot(tree_model)
text(tree_model)
Fit model:
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
x_skill_reduced_train <- x_skill_reduced[1:20000,]
x_skill_reduced_test <- x_skill_reduced[20001:28000,]
rf.fit <- randomForest(churn~.,data=x_skill_reduced_train, importance = TRUE, ntree = 500, mtry = 5)
rf.fit
##
## Call:
## randomForest(formula = churn ~ ., data = x_skill_reduced_train, importance = TRUE, ntree = 500, mtry = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 19.18%
## Confusion matrix:
## 0 1 class.error
## 0 4278 2811 0.39652983
## 1 1025 11886 0.07938967
Predict:
predictions <- predict(rf.fit, x_skill_reduced_test)
library(caret)
## Loading required package: lattice
accuracy_table <- confusionMatrix(predictions, x_skill_reduced_test$churn)
accuracy_table
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1632 409
## 1 1167 4792
##
## Accuracy : 0.803
## 95% CI : (0.7941, 0.8117)
## No Information Rate : 0.6501
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5381
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5831
## Specificity : 0.9214
## Pos Pred Value : 0.7996
## Neg Pred Value : 0.8042
## Prevalence : 0.3499
## Detection Rate : 0.2040
## Detection Prevalence : 0.2551
## Balanced Accuracy : 0.7522
##
## 'Positive' Class : 0
##
importance(rf.fit)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## X_id 108.937222 201.7210424 201.385506 1331.238112
## country 53.735106 -18.5461787 23.898107 386.832844
## device_brand 9.272208 -0.8258751 5.620608 183.759528
## device_model 31.616220 0.6672397 21.125656 618.084458
## session_cnt 47.984465 38.7050698 53.143727 493.672527
## session_length 18.477794 38.7461056 47.258828 350.929038
## lang 54.259147 -14.0314779 26.067932 406.604438
## max_lvl_no 33.866902 34.5110387 51.574454 507.842320
## gameplay_duration 19.033711 44.0672055 54.550536 356.411473
## bonus_cnt 4.774360 37.5327159 40.604553 235.960713
## hint1_cnt 4.673181 -4.8181425 -1.831629 7.687847
## hint2_cnt 11.220287 15.6530185 20.385167 133.705424
## hint3_cnt 65.350676 52.4189577 82.525600 652.255514
## repeat_cnt 5.926775 39.1738887 43.334983 268.899548
## gold_cnt 5.175028 32.0826281 32.786893 351.441040
## banner_cnt 11.515037 39.2345321 45.871689 313.970802
## is_cnt 19.169001 24.3696516 36.399276 221.374463
## rv_cnt -2.421731 34.0238136 31.976261 185.751573
## multi_lang 19.370295 5.6285204 18.220602 20.296478
## max_ses_length 15.164519 39.9997141 43.242556 337.411974
## median_ses_length 30.441889 31.2255973 44.852951 511.919290
## avg_ses_length 27.805952 33.3386544 48.980779 431.217184
## skill 36.678082 44.5458276 65.968913 845.824077
We got %80 accuracy on our test set.
When we look at the variable importances, it looks like the ID column is the most important by far followed by hint3_cnt and skill. Device brand and hint1_cnt has very little importance.
library(randomForest)
x_skill_reduced_train <- x_skill_reduced[1:20000,]
x_skill_reduced_test <- x_skill_reduced[20001:28000,]
rf.fit_2 <- randomForest(churn~.,data=x_skill_reduced_train, importance = TRUE, ntree = 1000)
rf.fit_2
##
## Call:
## randomForest(formula = churn ~ ., data = x_skill_reduced_train, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 19%
## Confusion matrix:
## 0 1 class.error
## 0 4269 2820 0.39779941
## 1 981 11930 0.07598172
predictions <- predict(rf.fit_2, x_skill_reduced_test)
accuracy_table <- confusionMatrix(predictions, x_skill_reduced_test$churn)
accuracy_table
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1613 403
## 1 1186 4798
##
## Accuracy : 0.8014
## 95% CI : (0.7925, 0.8101)
## No Information Rate : 0.6501
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5332
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5763
## Specificity : 0.9225
## Pos Pred Value : 0.8001
## Neg Pred Value : 0.8018
## Prevalence : 0.3499
## Detection Rate : 0.2016
## Detection Prevalence : 0.2520
## Balanced Accuracy : 0.7494
##
## 'Positive' Class : 0
##
importance(rf.fit_2)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## X_id 140.656586 226.8747294 234.938766 1247.821531
## country 73.056179 -29.1760630 31.796699 393.775162
## device_brand 11.365478 0.3168308 7.741574 187.721560
## device_model 40.554830 4.4925497 29.427952 593.996796
## session_cnt 65.131916 50.1698792 68.259587 478.913233
## session_length 25.023837 56.9788623 69.704869 366.869690
## lang 72.766187 -21.5350368 34.427194 410.006983
## max_lvl_no 49.240974 47.0332401 73.670509 516.816662
## gameplay_duration 25.553157 54.3552314 73.537160 366.000575
## bonus_cnt 9.291245 46.8951057 53.902685 243.397784
## hint1_cnt 7.390113 -8.8259221 -3.877599 7.780791
## hint2_cnt 13.769559 23.1082666 28.279376 141.455192
## hint3_cnt 88.916977 67.1273185 106.056852 645.282026
## repeat_cnt 9.204935 54.6857487 61.274740 276.610455
## gold_cnt 10.509182 42.4438193 43.328677 356.726603
## banner_cnt 15.804404 50.1201949 63.227288 322.635815
## is_cnt 28.205529 33.6783859 49.539047 236.534822
## rv_cnt -3.611371 44.4985424 41.252114 187.754332
## multi_lang 26.236048 8.8517091 25.593072 20.524749
## max_ses_length 20.541980 57.7586417 62.593033 356.284638
## median_ses_length 44.384494 44.9523424 66.203980 515.638682
## avg_ses_length 36.370464 47.2171876 67.278968 452.295258
## skill 50.490246 62.5563717 93.163265 820.883955
When we fit random forest using more trees, we see a negative impact of hint1_cnt on the model, which might mean it may improve the accuracy if we would remove this column. I got this result when I fit with 2500 trees.
I tried to fit gbm with different distributions but failed. Multinomial gave a warning that save it was broken right now, and it couldn’t do meaningful predictions. Bernoulli failed to fit. I also tried changing the churn column to numeric to fit bernoulli but it also failed.
This is why I will fit gradient boosting with h2o.
Install h2o and prepare the data for h2o:
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
##
## hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 53 minutes 57 seconds
## H2O cluster timezone: Europe/Istanbul
## H2O data parsing timezone: UTC
## H2O cluster version: 3.38.0.1
## H2O cluster version age: 3 months and 13 days !!!
## H2O cluster name: H2O_started_from_R_erkinyilmaz_yxq635
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.17 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.2.1 (2022-06-23)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (3 months and 13 days)!
## Please download and install the latest version from http://h2o.ai/download/
x_skill_reduced_h2o <- as.h2o(x_skill_reduced)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
##
|
| | 0%
|
|======================================================================| 100%
x_skill_reduced_h2o_train <- x_skill_reduced_h2o[1:20000,]
x_skill_reduced_h2o_test <- x_skill_reduced_h2o[20001:28000,]
Fit model:
# Get the names of all of the columns in the data
column_names <- names(x_skill_reduced_h2o)
# Remove the response variable from the list of column names
predictor_variables <- column_names[column_names != "churn"]
gbm.fit <- h2o.gbm(x = predictor_variables,
y = "churn",
training_frame = x_skill_reduced_h2o_train,
ntrees = 100,
max_depth = 5,
distribution = "bernoulli")
##
|
| | 0%
|
|============= | 18%
|
|=========================================== | 61%
|
|======================================================================| 100%
gbm.fit
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1672583265719_14612
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 100 41385 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 11 32 28.26000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.1112661
## RMSE: 0.3335657
## LogLoss: 0.3632016
## Mean Per-Class Error: 0.1856426
## AUC: 0.9115125
## AUCPR: 0.9428195
## Gini: 0.8230249
## R^2: 0.5137296
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 4933 2156 0.304133 =2156/7089
## 1 867 12044 0.067152 =867/12911
## Totals 5800 14200 0.151150 =3023/20000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.495741 0.888495 224
## 2 max f2 0.231996 0.934676 309
## 3 max f0point5 0.710989 0.884733 148
## 4 max accuracy 0.519646 0.849200 217
## 5 max precision 0.982826 1.000000 0
## 6 max recall 0.050175 1.000000 388
## 7 max specificity 0.982826 1.000000 0
## 8 max absolute_mcc 0.519646 0.663772 217
## 9 max min_per_class_accuracy 0.688043 0.829680 157
## 10 max mean_per_class_accuracy 0.670376 0.830743 165
## 11 max tns 0.982826 7089.000000 0
## 12 max fns 0.982826 12880.000000 0
## 13 max fps 0.012078 7089.000000 399
## 14 max tps 0.050175 12911.000000 388
## 15 max tnr 0.982826 1.000000 0
## 16 max fnr 0.982826 0.997599 0
## 17 max fpr 0.012078 1.000000 399
## 18 max tpr 0.050175 1.000000 388
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp_plot(gbm.fit)
gbm model picks “skill” feature that we created as the most important feature among predictors.
Predictions and accuracy:
h2o.performance(gbm.fit, x_skill_reduced_h2o_test)
## H2OBinomialMetrics: gbm
##
## MSE: 0.1437507
## RMSE: 0.3791447
## LogLoss: 0.4530854
## Mean Per-Class Error: 0.258661
## AUC: 0.8381856
## AUCPR: 0.8787315
## Gini: 0.6763713
## R^2: 0.3680245
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1520 1279 0.456949 =1279/2799
## 1 314 4887 0.060373 =314/5201
## Totals 1834 6166 0.199125 =1593/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.385004 0.859857 265
## 2 max f2 0.202673 0.922341 323
## 3 max f0point5 0.662970 0.842887 174
## 4 max accuracy 0.474420 0.803875 237
## 5 max precision 0.988740 1.000000 0
## 6 max recall 0.030222 1.000000 395
## 7 max specificity 0.988740 1.000000 0
## 8 max absolute_mcc 0.474420 0.554598 237
## 9 max min_per_class_accuracy 0.708842 0.765273 156
## 10 max mean_per_class_accuracy 0.662970 0.773858 174
## 11 max tns 0.988740 2799.000000 0
## 12 max fns 0.988740 5200.000000 0
## 13 max fps 0.007193 2799.000000 399
## 14 max tps 0.030222 5201.000000 395
## 15 max tnr 0.988740 1.000000 0
## 16 max fnr 0.988740 0.999808 0
## 17 max fpr 0.007193 1.000000 399
## 18 max tpr 0.030222 1.000000 395
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1593/8000) #our accuracy
## [1] 0.800875
We get %80 accuracy. (We calculate this from the confusion matrix)
Now let’s try it with the model that is not reduced in its cardinality:
x_skill_h2o <- as.h2o(x_skill)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
##
|
| | 0%
|
|======================================================================| 100%
x_skill_h2o_train <- x_skill_h2o[1:20000,]
x_skill_h2o_test <- x_skill_h2o[20001:28000,]
Fit model:
# Get the names of all of the columns in the data
column_names <- names(x_skill_h2o)
# Remove the response variable from the list of column names
predictor_variables <- column_names[column_names != "churn"]
gbm.fit <- h2o.gbm(x = predictor_variables,
y = "churn",
training_frame = x_skill_h2o_train,
ntrees = 100,
learn_rate = 0.1,
max_depth = 5,
distribution = "bernoulli")
##
|
| | 0%
|
|=============== | 21%
|
|==================================================== | 75%
|
|======================================================================| 100%
gbm.fit
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1672583265719_14701
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 100 61467 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 12 32 28.32000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.1047529
## RMSE: 0.3236555
## LogLoss: 0.3456253
## Mean Per-Class Error: 0.1766625
## AUC: 0.9206019
## AUCPR: 0.9477233
## Gini: 0.8412037
## R^2: 0.5421943
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 4984 2105 0.296939 =2105/7089
## 1 728 12183 0.056386 =728/12911
## Totals 5712 14288 0.141650 =2833/20000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.481337 0.895842 232
## 2 max f2 0.286516 0.939697 292
## 3 max f0point5 0.738911 0.890572 141
## 4 max accuracy 0.511068 0.858650 223
## 5 max precision 0.985792 1.000000 0
## 6 max recall 0.053208 1.000000 384
## 7 max specificity 0.985792 1.000000 0
## 8 max absolute_mcc 0.490417 0.685377 229
## 9 max min_per_class_accuracy 0.696010 0.836341 158
## 10 max mean_per_class_accuracy 0.650240 0.840662 175
## 11 max tns 0.985792 7089.000000 0
## 12 max fns 0.985792 12894.000000 0
## 13 max fps 0.013926 7089.000000 399
## 14 max tps 0.053208 12911.000000 384
## 15 max tnr 0.985792 1.000000 0
## 16 max fnr 0.985792 0.998683 0
## 17 max fpr 0.013926 1.000000 399
## 18 max tpr 0.053208 1.000000 384
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.varimp_plot(gbm.fit)
Accuracy:
h2o.performance(gbm.fit, x_skill_h2o_test)
## H2OBinomialMetrics: gbm
##
## MSE: 0.1418578
## RMSE: 0.3766401
## LogLoss: 0.4482721
## Mean Per-Class Error: 0.2511426
## AUC: 0.8405864
## AUCPR: 0.8780414
## Gini: 0.6811728
## R^2: 0.3763465
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1582 1217 0.434798 =1217/2799
## 1 351 4850 0.067487 =351/5201
## Totals 1933 6067 0.196000 =1568/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.418627 0.860845 257
## 2 max f2 0.170998 0.923776 333
## 3 max f0point5 0.657212 0.844406 174
## 4 max accuracy 0.515190 0.807375 226
## 5 max precision 0.966656 0.940678 6
## 6 max recall 0.031235 1.000000 393
## 7 max specificity 0.982469 0.999285 0
## 8 max absolute_mcc 0.543224 0.564413 217
## 9 max min_per_class_accuracy 0.713304 0.765631 152
## 10 max mean_per_class_accuracy 0.624384 0.776350 187
## 11 max tns 0.982469 2797.000000 0
## 12 max fns 0.982469 5196.000000 0
## 13 max fps 0.006840 2799.000000 399
## 14 max tps 0.031235 5201.000000 393
## 15 max tnr 0.982469 0.999285 0
## 16 max fnr 0.982469 0.999039 0
## 17 max fpr 0.006840 1.000000 399
## 18 max tpr 0.031235 1.000000 393
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Our accuracy increased by a very small amount.
xgb.fit <- h2o.xgboost(x = predictor_variables, y = "churn", training_frame = x_skill_h2o_train)
##
|
| | 0%
|
|============== | 20%
|
|============================================= | 64%
|
|======================================================================| 100%
xgb.fit
## Model Details:
## ==============
##
## H2OBinomialModel: xgboost
## Model ID: XGBoost_model_R_1672583265719_14804
## Model Summary:
## number_of_trees
## 1 50
##
##
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
##
## MSE: 0.1080628
## RMSE: 0.3287291
## LogLoss: 0.3516744
## Mean Per-Class Error: 0.1852898
## AUC: 0.9183564
## AUCPR: 0.9494955
## Gini: 0.8367128
## R^2: 0.5277288
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 4910 2179 0.307378 =2179/7089
## 1 816 12095 0.063202 =816/12911
## Totals 5726 14274 0.149750 =2995/20000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.492955 0.889829 232
## 2 max f2 0.278149 0.935720 299
## 3 max f0point5 0.694774 0.890677 160
## 4 max accuracy 0.548336 0.851550 214
## 5 max precision 0.987350 1.000000 0
## 6 max recall 0.039872 1.000000 388
## 7 max specificity 0.987350 1.000000 0
## 8 max absolute_mcc 0.555534 0.670119 212
## 9 max min_per_class_accuracy 0.680782 0.836648 167
## 10 max mean_per_class_accuracy 0.670212 0.837841 171
## 11 max tns 0.987350 7089.000000 0
## 12 max fns 0.987350 12893.000000 0
## 13 max fps 0.007559 7089.000000 399
## 14 max tps 0.039872 12911.000000 388
## 15 max tnr 0.987350 1.000000 0
## 16 max fnr 0.987350 0.998606 0
## 17 max fpr 0.007559 1.000000 399
## 18 max tpr 0.039872 1.000000 388
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(xgb.fit, x_skill_h2o_test)
## H2OBinomialMetrics: xgboost
##
## MSE: 0.1412545
## RMSE: 0.3758384
## LogLoss: 0.4453738
## Mean Per-Class Error: 0.2459472
## AUC: 0.8455429
## AUCPR: 0.8866394
## Gini: 0.6910858
## R^2: 0.3789987
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1624 1175 0.419793 =1175/2799
## 1 375 4826 0.072102 =375/5201
## Totals 1999 6001 0.193750 =1550/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.431566 0.861632 256
## 2 max f2 0.202383 0.922765 325
## 3 max f0point5 0.686181 0.843017 164
## 4 max accuracy 0.524701 0.807500 224
## 5 max precision 0.987680 1.000000 0
## 6 max recall 0.029208 1.000000 393
## 7 max specificity 0.987680 1.000000 0
## 8 max absolute_mcc 0.524701 0.565122 224
## 9 max min_per_class_accuracy 0.708116 0.767352 155
## 10 max mean_per_class_accuracy 0.606980 0.775497 193
## 11 max tns 0.987680 2799.000000 0
## 12 max fns 0.987680 5194.000000 0
## 13 max fps 0.006029 2799.000000 399
## 14 max tps 0.029208 5201.000000 393
## 15 max tnr 0.987680 1.000000 0
## 16 max fnr 0.987680 0.998654 0
## 17 max fpr 0.006029 1.000000 399
## 18 max tpr 0.029208 1.000000 393
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1550/8000)
## [1] 0.80625
We get slightly better accuracy (still smaller than %81). I will try with the reduced data.
xgb.fit <- h2o.xgboost(x = predictor_variables, y = "churn", training_frame = x_skill_reduced_h2o_train)
##
|
| | 0%
|
|=============== | 22%
|
|============================================= | 64%
|
|======================================================================| 100%
xgb.fit
## Model Details:
## ==============
##
## H2OBinomialModel: xgboost
## Model ID: XGBoost_model_R_1672583265719_14852
## Model Summary:
## number_of_trees
## 1 50
##
##
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
##
## MSE: 0.09636587
## RMSE: 0.3104285
## LogLoss: 0.3194093
## Mean Per-Class Error: 0.1515705
## AUC: 0.9377671
## AUCPR: 0.9628305
## Gini: 0.8755342
## R^2: 0.5788485
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 5460 1629 0.229793 =1629/7089
## 1 947 11964 0.073348 =947/12911
## Totals 6407 13593 0.128800 =2576/20000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.551470 0.902807 213
## 2 max f2 0.283838 0.940270 296
## 3 max f0point5 0.710059 0.907574 154
## 4 max accuracy 0.578007 0.872050 204
## 5 max precision 0.992283 1.000000 0
## 6 max recall 0.035310 1.000000 387
## 7 max specificity 0.992283 1.000000 0
## 8 max absolute_mcc 0.589274 0.718040 200
## 9 max min_per_class_accuracy 0.674724 0.859887 168
## 10 max mean_per_class_accuracy 0.634102 0.861320 185
## 11 max tns 0.992283 7089.000000 0
## 12 max fns 0.992283 12899.000000 0
## 13 max fps 0.006358 7089.000000 399
## 14 max tps 0.035310 12911.000000 387
## 15 max tnr 0.992283 1.000000 0
## 16 max fnr 0.992283 0.999071 0
## 17 max fpr 0.006358 1.000000 399
## 18 max tpr 0.035310 1.000000 387
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.performance(xgb.fit, x_skill_reduced_h2o_test)
## H2OBinomialMetrics: xgboost
##
## MSE: 0.1438475
## RMSE: 0.3792723
## LogLoss: 0.453158
## Mean Per-Class Error: 0.2595262
## AUC: 0.8400971
## AUCPR: 0.8815367
## Gini: 0.6801941
## R^2: 0.3675991
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1520 1279 0.456949 =1279/2799
## 1 323 4878 0.062103 =323/5201
## Totals 1843 6157 0.200250 =1602/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.374642 0.858954 269
## 2 max f2 0.125937 0.923370 346
## 3 max f0point5 0.647444 0.842209 183
## 4 max accuracy 0.453353 0.802750 245
## 5 max precision 0.990723 1.000000 0
## 6 max recall 0.018043 1.000000 395
## 7 max specificity 0.990723 1.000000 0
## 8 max absolute_mcc 0.466034 0.551566 241
## 9 max min_per_class_accuracy 0.717769 0.763487 155
## 10 max mean_per_class_accuracy 0.647444 0.773594 183
## 11 max tns 0.990723 2799.000000 0
## 12 max fns 0.990723 5190.000000 0
## 13 max fps 0.002912 2799.000000 399
## 14 max tps 0.018043 5201.000000 395
## 15 max tnr 0.990723 1.000000 0
## 16 max fnr 0.990723 0.997885 0
## 17 max fpr 0.002912 1.000000 399
## 18 max tpr 0.018043 1.000000 395
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
We got slightly worse accuracy.
Finally, I will try AutoML.
automl.fit <- h2o.automl(
x = predictor_variables,
y = "churn",
training_frame = x_skill_h2o_train,
max_models = 10,
seed = 1)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|================ | 22%
|
|================= | 24%
|
|=================== | 27%
|
|======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
## model_id auc logloss
## 1 StackedEnsemble_AllModels_1_AutoML_9_20230101_182207 0.8575123 0.4301484
## 2 StackedEnsemble_BestOfFamily_1_AutoML_9_20230101_182207 0.8563241 0.4322788
## 3 XGBoost_3_AutoML_9_20230101_182207 0.8522842 0.4375503
## 4 GBM_2_AutoML_9_20230101_182207 0.8502435 0.4407005
## 5 GBM_1_AutoML_9_20230101_182207 0.8486417 0.4419708
## 6 GBM_3_AutoML_9_20230101_182207 0.8466284 0.4441636
## 7 GBM_4_AutoML_9_20230101_182207 0.8424572 0.4522268
## 8 XGBoost_2_AutoML_9_20230101_182207 0.8408336 0.4619994
## 9 XGBoost_1_AutoML_9_20230101_182207 0.8349079 0.4713581
## 10 XRT_1_AutoML_9_20230101_182207 0.8310142 0.4821717
## 11 DRF_1_AutoML_9_20230101_182207 0.8171732 0.4953259
## 12 GLM_1_AutoML_9_20230101_182207 0.7730249 0.5436575
## aucpr mean_per_class_error rmse mse
## 1 0.8958364 0.2449691 0.3686387 0.1358945
## 2 0.8944346 0.2410902 0.3696474 0.1366392
## 3 0.8907117 0.2401100 0.3720752 0.1384400
## 4 0.8893884 0.2345314 0.3735160 0.1395142
## 5 0.8879666 0.2392166 0.3740593 0.1399203
## 6 0.8855012 0.2459748 0.3748629 0.1405222
## 7 0.8832669 0.2406388 0.3788148 0.1435006
## 8 0.8811776 0.2451194 0.3813694 0.1454426
## 9 0.8766028 0.2617947 0.3853821 0.1485194
## 10 0.8763051 0.2617249 0.3937005 0.1550000
## 11 0.8697120 0.2750101 0.4009410 0.1607536
## 12 0.8392896 0.3535085 0.4242718 0.1800065
##
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_skill_h2o_test)
## H2OBinomialMetrics: stackedensemble
##
## MSE: 0.1394035
## RMSE: 0.3733678
## LogLoss: 0.4401468
## Mean Per-Class Error: 0.2513921
## AUC: 0.8472375
## AUCPR: 0.8869587
## Gini: 0.6944751
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1558 1241 0.443373 =1241/2799
## 1 309 4892 0.059412 =309/5201
## Totals 1867 6133 0.193750 =1550/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.393993 0.863243 267
## 2 max f2 0.179430 0.924760 332
## 3 max f0point5 0.655320 0.847608 177
## 4 max accuracy 0.509947 0.809750 229
## 5 max precision 0.965852 0.969697 4
## 6 max recall 0.045866 1.000000 390
## 7 max specificity 0.977023 0.999643 0
## 8 max absolute_mcc 0.509947 0.569162 229
## 9 max min_per_class_accuracy 0.710279 0.769852 155
## 10 max mean_per_class_accuracy 0.649793 0.780958 179
## 11 max tns 0.977023 2798.000000 0
## 12 max fns 0.977023 5196.000000 0
## 13 max fps 0.012676 2799.000000 399
## 14 max tps 0.045866 5201.000000 390
## 15 max tnr 0.977023 0.999643 0
## 16 max fnr 0.977023 0.999039 0
## 17 max fpr 0.012676 1.000000 399
## 18 max tpr 0.045866 1.000000 390
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1550/8000)
## [1] 0.80625
We get %80.6 accuracy.
I will fit the same model to not reduced data set.
automl.fit <- h2o.automl(
x = predictor_variables,
y = "churn",
training_frame = x_skill_reduced_h2o_train,
max_models = 10,
seed = 1)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|================ | 22%
|
|================= | 24%
|
|=================== | 27%
|
|======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
## model_id auc logloss
## 1 StackedEnsemble_AllModels_1_AutoML_10_20230101_182434 0.8543237 0.4337217
## 2 StackedEnsemble_BestOfFamily_1_AutoML_10_20230101_182434 0.8534913 0.4351727
## 3 XGBoost_3_AutoML_10_20230101_182434 0.8499165 0.4399922
## 4 GBM_2_AutoML_10_20230101_182434 0.8477287 0.4434295
## 5 GBM_1_AutoML_10_20230101_182434 0.8462881 0.4448453
## 6 GBM_3_AutoML_10_20230101_182434 0.8440609 0.4476523
## 7 GBM_4_AutoML_10_20230101_182434 0.8408652 0.4537178
## 8 XGBoost_2_AutoML_10_20230101_182434 0.8369381 0.4676053
## 9 XGBoost_1_AutoML_10_20230101_182434 0.8345407 0.4720734
## 10 XRT_1_AutoML_10_20230101_182434 0.8298986 0.4834764
## 11 DRF_1_AutoML_10_20230101_182434 0.8188427 0.4936558
## 12 GLM_1_AutoML_10_20230101_182434 0.7695699 0.5463848
## aucpr mean_per_class_error rmse mse
## 1 0.8932174 0.2427178 0.3701304 0.1369965
## 2 0.8922663 0.2483295 0.3707452 0.1374520
## 3 0.8889088 0.2468598 0.3727142 0.1389159
## 4 0.8876163 0.2490767 0.3748870 0.1405403
## 5 0.8868204 0.2437526 0.3756440 0.1411084
## 6 0.8837560 0.2619959 0.3767716 0.1419568
## 7 0.8807047 0.2553473 0.3795248 0.1440390
## 8 0.8780063 0.2678502 0.3839408 0.1474106
## 9 0.8765047 0.2556476 0.3856724 0.1487432
## 10 0.8761837 0.2670490 0.3946381 0.1557392
## 11 0.8713483 0.2873012 0.4003316 0.1602654
## 12 0.8369895 0.3532583 0.4250694 0.1806840
##
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_skill_reduced_h2o_test)
## H2OBinomialMetrics: stackedensemble
##
## MSE: 0.1412805
## RMSE: 0.375873
## LogLoss: 0.4458455
## Mean Per-Class Error: 0.2447376
## AUC: 0.8428508
## AUCPR: 0.8818931
## Gini: 0.6857017
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1634 1165 0.416220 =1165/2799
## 1 381 4820 0.073255 =381/5201
## Totals 2015 5985 0.193250 =1546/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.443758 0.861792 249
## 2 max f2 0.144687 0.923757 344
## 3 max f0point5 0.647552 0.843305 179
## 4 max accuracy 0.457275 0.807500 244
## 5 max precision 0.978902 1.000000 0
## 6 max recall 0.038369 1.000000 394
## 7 max specificity 0.978902 1.000000 0
## 8 max absolute_mcc 0.457275 0.562705 244
## 9 max min_per_class_accuracy 0.707354 0.765988 154
## 10 max mean_per_class_accuracy 0.639264 0.775529 182
## 11 max tns 0.978902 2799.000000 0
## 12 max fns 0.978902 5198.000000 0
## 13 max fps 0.018707 2799.000000 399
## 14 max tps 0.038369 5201.000000 394
## 15 max tnr 0.978902 1.000000 0
## 16 max fnr 0.978902 0.999423 0
## 17 max fpr 0.018707 1.000000 399
## 18 max tpr 0.038369 1.000000 394
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1546/8000)
## [1] 0.80675
We get %80.6 accuracy.
I will try without skill feature:
x_h2o <- as.h2o(x)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
##
|
| | 0%
|
|======================================================================| 100%
str(x_h2o)
## Class 'H2OFrame' <environment: 0x7fe1fd913c90>
## - attr(*, "op")= chr "Parse"
## - attr(*, "id")= chr "x_sid_af00_243"
## - attr(*, "eval")= logi FALSE
## - attr(*, "nrow")= int 30000
## - attr(*, "ncol")= int 23
## - attr(*, "types")=List of 23
## ..$ : chr "int"
## ..$ : chr "enum"
## ..$ : chr "enum"
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "enum"
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "int"
## ..$ : chr "int"
## - attr(*, "data")='data.frame': 10 obs. of 23 variables:
## ..$ X_id : num 88464659 88466479 88468912 88471351 88473161 ...
## ..$ country : Factor w/ 71 levels "AE","AL","AM",..: 38 10 58 40 58 10 26 58 30 51
## ..$ device_brand : Factor w/ 113 levels "","A-gold","A1",..: 108 90 45 45 45 90 90 108 108 108
## ..$ device_model : Factor w/ 1720 levels "","10C_LTE","1AZ2P",..: 960 875 168 1256 521 610 887 982 975 1005
## ..$ session_cnt : num 7 1 1 2 1 3 1 3 1 8
## ..$ session_length : num 13162 1809 2141 1298 666 ...
## ..$ lang : Factor w/ 28 levels "AZ","BR","BU",..: 15 2 22 16 22 2 10 22 11 18
## ..$ max_lvl_no : num 13 9 28 14 12 20 9 23 9 1
## ..$ gameplay_duration: num 1088 1898 1685 1221 532 ...
## ..$ bonus_cnt : num 6 9 8 2 4 23 3 3 0 15
## ..$ hint1_cnt : num 0 0 0 0 0 0 0 0 0 0
## ..$ hint2_cnt : num 2 1 0 0 1 0 0 0 0 0
## ..$ hint3_cnt : num 0 0 0 0 0 0 0 0 0 2
## ..$ repeat_cnt : num 7 10 21 7 10 30 4 1 1 38
## ..$ gold_cnt : num 418 25 990 1360 1367 ...
## ..$ banner_cnt : num 17 38 0 49 23 154 16 33 9 14
## ..$ is_cnt : num 1 0 16 2 0 7 0 11 0 2
## ..$ rv_cnt : num 0 1 1 1 1 3 0 1 1 4
## ..$ churn : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
## ..$ multi_lang : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1
## ..$ max_ses_length : num 12451 1809 2141 1248 666 ...
## ..$ median_ses_length: num 100 1809 2141 649 666 ...
## ..$ avg_ses_length : num 1880 1809 2141 649 666 ...
x_h2o_train <- x_h2o[1:20000, ]
x_h2o_test <- x_h2o[20001:28000, ]
# Get the names of all of the columns in the data
column_names <- names(x_h2o)
# Remove the response variable from the list of column names
predictor_variables <- column_names[column_names != "churn"]
automl.fit <- h2o.automl(
x = predictor_variables,
y = "churn",
training_frame = x_h2o_train,
max_models = 10,
seed = 1)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|================ | 22%
|
|================= | 24%
|
|=================== | 27%
|
|======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
## model_id auc logloss
## 1 StackedEnsemble_AllModels_1_AutoML_11_20230101_182703 0.8629645 0.4307408
## 2 StackedEnsemble_BestOfFamily_1_AutoML_11_20230101_182703 0.8618528 0.4326387
## 3 GBM_1_AutoML_11_20230101_182703 0.8568192 0.4397327
## 4 XGBoost_3_AutoML_11_20230101_182703 0.8567609 0.4407010
## 5 GBM_2_AutoML_11_20230101_182703 0.8563977 0.4399842
## 6 GBM_3_AutoML_11_20230101_182703 0.8543046 0.4431551
## 7 GBM_4_AutoML_11_20230101_182703 0.8505075 0.4487417
## 8 XGBoost_2_AutoML_11_20230101_182703 0.8448344 0.4654049
## 9 XGBoost_1_AutoML_11_20230101_182703 0.8430556 0.4691345
## 10 XRT_1_AutoML_11_20230101_182703 0.8396969 0.4845954
## 11 DRF_1_AutoML_11_20230101_182703 0.8288493 0.4934903
## 12 GLM_1_AutoML_11_20230101_182703 0.7678991 0.5685092
## aucpr mean_per_class_error rmse mse
## 1 0.8932263 0.2312695 0.3691162 0.1362468
## 2 0.8924886 0.2313376 0.3701826 0.1370351
## 3 0.8871571 0.2417333 0.3733418 0.1393841
## 4 0.8870988 0.2327261 0.3737981 0.1397250
## 5 0.8873870 0.2327423 0.3734084 0.1394339
## 6 0.8835169 0.2333443 0.3747803 0.1404603
## 7 0.8818336 0.2455255 0.3776878 0.1426481
## 8 0.8771850 0.2380139 0.3833282 0.1469405
## 9 0.8737188 0.2385371 0.3842626 0.1476577
## 10 0.8752382 0.2507112 0.3950898 0.1560960
## 11 0.8717912 0.2722333 0.4005574 0.1604462
## 12 0.8236295 0.3429454 0.4321508 0.1867543
##
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_h2o_test)
## H2OBinomialMetrics: stackedensemble
##
## MSE: 0.1391196
## RMSE: 0.3729874
## LogLoss: 0.4390133
## Mean Per-Class Error: 0.2362447
## AUC: 0.8523114
## AUCPR: 0.8862735
## Gini: 0.7046229
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 1737 1151 0.398546 =1151/2888
## 1 378 4734 0.073944 =378/5112
## Totals 2115 5885 0.191125 =1529/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.445769 0.860962 244
## 2 max f2 0.228322 0.921725 311
## 3 max f0point5 0.642056 0.843388 176
## 4 max accuracy 0.519997 0.809250 221
## 5 max precision 0.982340 1.000000 0
## 6 max recall 0.040029 1.000000 390
## 7 max specificity 0.982340 1.000000 0
## 8 max absolute_mcc 0.519997 0.575918 221
## 9 max min_per_class_accuracy 0.698417 0.775277 154
## 10 max mean_per_class_accuracy 0.632823 0.782535 179
## 11 max tns 0.982340 2888.000000 0
## 12 max fns 0.982340 5111.000000 0
## 13 max fps 0.014151 2888.000000 399
## 14 max tps 0.040029 5112.000000 390
## 15 max tnr 0.982340 1.000000 0
## 16 max fnr 0.982340 0.999804 0
## 17 max fpr 0.014151 1.000000 399
## 18 max tpr 0.040029 1.000000 390
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
1-(1529/8000)
## [1] 0.808875
our accuracy is almost the same, but slightly higher than the version with skill.
I also tried increasing the max models but the accuracy did not improve (may be caused by overfitting from too many models).
lg <- glm(churn~., data = x_skill_reduced_train, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg)
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = x_skill_reduced_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7079 -0.9561 0.5628 0.8214 5.1098
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.707e+01 1.023e+00 26.461 < 2e-16 ***
## X_id -2.318e-07 8.386e-09 -27.639 < 2e-16 ***
## countryBR -1.983e+00 5.931e-01 -3.343 0.000828 ***
## countryCZ 6.164e-01 6.416e-01 0.961 0.336738
## countryFR -2.568e-01 4.686e-01 -0.548 0.583714
## countryGR -1.146e+00 5.650e-01 -2.028 0.042604 *
## countryHU -3.638e-02 4.419e-01 -0.082 0.934390
## countryIT -8.162e-01 4.668e-01 -1.749 0.080367 .
## countryLT -5.211e-01 5.313e-01 -0.981 0.326698
## countryMX 8.100e-02 3.738e-01 0.217 0.828457
## countryNL -8.973e-01 3.922e-01 -2.288 0.022158 *
## countryOther -6.750e-01 3.366e-01 -2.005 0.044918 *
## countryPL -2.976e-01 4.765e-01 -0.624 0.532363
## countryRO -2.857e-01 3.901e-01 -0.732 0.463945
## countrySE -2.123e-01 5.180e-01 -0.410 0.681833
## countrySK -9.686e-02 4.794e-01 -0.202 0.839902
## countryTR -5.387e-01 5.107e-01 -1.055 0.291424
## device_brandmotorola -2.248e-01 1.040e-01 -2.161 0.030723 *
## device_brandOPPO -7.638e-01 1.076e-01 -7.101 1.24e-12 ***
## device_brandOther -3.625e-01 7.996e-02 -4.533 5.81e-06 ***
## device_brandsamsung -3.732e-01 6.229e-02 -5.991 2.08e-09 ***
## device_brandXiaomi -4.684e-01 7.008e-02 -6.683 2.34e-11 ***
## device_modelANE-LX1 -1.816e+00 3.270e-01 -5.556 2.77e-08 ***
## device_modelM2003J15SC -1.529e+00 3.159e-01 -4.840 1.30e-06 ***
## device_modelM2004J19C -1.504e+00 3.276e-01 -4.591 4.42e-06 ***
## device_modelMAR-LX1A -2.078e+00 3.095e-01 -6.714 1.89e-11 ***
## device_modelOther -1.857e+00 2.872e-01 -6.467 1.00e-10 ***
## device_modelPOT-LX1 -1.948e+00 3.252e-01 -5.991 2.08e-09 ***
## device_modelRedmi Note 7 -1.459e+00 3.465e-01 -4.212 2.53e-05 ***
## device_modelRedmi Note 8 -1.358e+00 3.536e-01 -3.841 0.000122 ***
## device_modelRedmi Note 8 Pro -1.575e+00 3.208e-01 -4.909 9.13e-07 ***
## device_modelRedmi Note 8T -1.233e+00 3.406e-01 -3.619 0.000296 ***
## device_modelRedmi Note 9 Pro -1.435e+00 3.168e-01 -4.529 5.93e-06 ***
## device_modelSM-A105FN -1.415e+00 3.314e-01 -4.271 1.95e-05 ***
## device_modelSM-A125F -2.188e+00 3.255e-01 -6.723 1.78e-11 ***
## device_modelSM-A127F -3.390e+00 3.499e-01 -9.689 < 2e-16 ***
## device_modelSM-A202F -1.536e+00 3.171e-01 -4.845 1.27e-06 ***
## device_modelSM-A217F -1.603e+00 3.147e-01 -5.093 3.52e-07 ***
## device_modelSM-A405FN -1.560e+00 3.405e-01 -4.580 4.64e-06 ***
## device_modelSM-A505FN -1.538e+00 3.191e-01 -4.818 1.45e-06 ***
## device_modelSM-A515F -1.593e+00 3.056e-01 -5.213 1.85e-07 ***
## device_modelSM-A528B -3.253e+00 3.506e-01 -9.278 < 2e-16 ***
## device_modelSM-A705FN -1.463e+00 3.404e-01 -4.298 1.72e-05 ***
## device_modelSM-A715F -1.765e+00 3.128e-01 -5.643 1.67e-08 ***
## device_modelSM-G950F -1.474e+00 3.396e-01 -4.339 1.43e-05 ***
## device_modelSM-G960F -1.396e+00 3.394e-01 -4.114 3.89e-05 ***
## device_modelSM-G965F -1.265e+00 3.519e-01 -3.595 0.000324 ***
## device_modelSM-G973F -1.480e+00 3.275e-01 -4.518 6.23e-06 ***
## device_modelSM-G975F -1.463e+00 3.523e-01 -4.152 3.30e-05 ***
## device_modelSM-G991B -2.446e+00 3.303e-01 -7.406 1.30e-13 ***
## device_modelSNE-LX1 -2.226e+00 3.371e-01 -6.603 4.03e-11 ***
## device_modelVOG-L29 -2.076e+00 3.265e-01 -6.360 2.02e-10 ***
## session_cnt 5.607e-03 2.605e-03 2.152 0.031397 *
## session_length -2.117e-06 9.894e-07 -2.140 0.032376 *
## langBU -2.510e+00 5.945e-01 -4.221 2.43e-05 ***
## langCS -3.170e+00 7.258e-01 -4.367 1.26e-05 ***
## langES -2.282e+00 5.065e-01 -4.506 6.62e-06 ***
## langFR -2.469e+00 5.985e-01 -4.125 3.71e-05 ***
## langGR -1.079e+00 6.712e-01 -1.607 0.108054
## langHU -2.424e+00 5.806e-01 -4.175 2.98e-05 ***
## langIT -1.417e+00 5.871e-01 -2.413 0.015828 *
## langLT -2.034e+00 6.382e-01 -3.187 0.001439 **
## langNL -1.715e+00 5.482e-01 -3.129 0.001754 **
## langOther -2.171e+00 4.952e-01 -4.384 1.17e-05 ***
## langPL -2.222e+00 6.039e-01 -3.680 0.000233 ***
## langRO -1.745e+00 5.335e-01 -3.271 0.001073 **
## langRU -2.040e+00 5.235e-01 -3.897 9.75e-05 ***
## langSE -2.090e+00 6.200e-01 -3.371 0.000750 ***
## langSK -2.260e+00 5.997e-01 -3.769 0.000164 ***
## langTR -1.726e+00 6.487e-01 -2.662 0.007776 **
## max_lvl_no -1.551e-02 1.839e-03 -8.438 < 2e-16 ***
## gameplay_duration -2.592e-06 1.335e-06 -1.942 0.052088 .
## bonus_cnt -5.744e-04 1.348e-04 -4.262 2.02e-05 ***
## hint1_cnt -1.998e-03 1.576e-03 -1.268 0.204871
## hint2_cnt 3.471e-02 5.291e-03 6.560 5.38e-11 ***
## hint3_cnt 3.809e-04 5.006e-04 0.761 0.446657
## repeat_cnt 1.887e-05 1.753e-04 0.108 0.914287
## gold_cnt 1.954e-04 3.964e-05 4.928 8.30e-07 ***
## banner_cnt -2.127e-04 9.657e-05 -2.202 0.027644 *
## is_cnt 5.023e-03 8.359e-04 6.009 1.87e-09 ***
## rv_cnt -5.664e-04 4.715e-04 -1.201 0.229651
## multi_langTRUE 1.244e+00 1.613e-01 7.715 1.21e-14 ***
## max_ses_length 7.629e-06 4.853e-06 1.572 0.115916
## median_ses_length -3.627e-04 3.349e-05 -10.830 < 2e-16 ***
## avg_ses_length 1.502e-05 2.867e-05 0.524 0.600337
## skill -1.120e+00 5.294e-02 -21.158 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26006 on 19999 degrees of freedom
## Residual deviance: 21602 on 19914 degrees of freedom
## AIC: 21774
##
## Number of Fisher Scoring iterations: 6
lg.probs <- predict(lg, x_skill_reduced_test, type="response")
lg.pred = rep("0", 8000)
lg.pred[lg.probs >.55]="1"
act_churn <- x_skill_reduced_test$churn
table(lg.pred,act_churn)
## act_churn
## lg.pred 0 1
## 0 1481 866
## 1 1318 4335
mean(lg.pred==act_churn)
## [1] 0.727
We got %72.7 accuracy which is pretty bad compared to others.
Out of all the models, AutoML did best. It is a question of choosing the data now (reduced, skill added or the first format). Because of two reasons, I will simply fit automl without reducing the categories or adding skill:
So I will train the final model on the whole data set.
automl.fit <- h2o.automl(
x = predictor_variables,
y = "churn",
training_frame = x_h2o,
max_models = 10,
seed = 1)
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|================= | 24%
|
|=================== | 27%
|
|======================================================================| 100%
aml <- automl.fit
lb <- aml@leaderboard
print(lb, n = nrow(lb))
## model_id auc logloss
## 1 StackedEnsemble_AllModels_1_AutoML_12_20230101_182943 0.8625573 0.4300385
## 2 StackedEnsemble_BestOfFamily_1_AutoML_12_20230101_182943 0.8607965 0.4326567
## 3 XGBoost_3_AutoML_12_20230101_182943 0.8577541 0.4379029
## 4 GBM_2_AutoML_12_20230101_182943 0.8546303 0.4410855
## 5 GBM_3_AutoML_12_20230101_182943 0.8527280 0.4438705
## 6 GBM_1_AutoML_12_20230101_182943 0.8518206 0.4455442
## 7 GBM_4_AutoML_12_20230101_182943 0.8504611 0.4477921
## 8 XGBoost_2_AutoML_12_20230101_182943 0.8479007 0.4553089
## 9 XGBoost_1_AutoML_12_20230101_182943 0.8434481 0.4652587
## 10 XRT_1_AutoML_12_20230101_182943 0.8357179 0.4891230
## 11 DRF_1_AutoML_12_20230101_182943 0.8280517 0.4941360
## 12 GLM_1_AutoML_12_20230101_182943 0.7663673 0.5700323
## aucpr mean_per_class_error rmse mse
## 1 0.8926119 0.2306650 0.3689101 0.1360947
## 2 0.8907627 0.2213627 0.3701911 0.1370415
## 3 0.8883397 0.2392026 0.3727654 0.1389540
## 4 0.8837753 0.2318905 0.3739344 0.1398269
## 5 0.8816412 0.2361353 0.3752812 0.1408360
## 6 0.8819648 0.2422024 0.3760066 0.1413809
## 7 0.8802327 0.2444602 0.3771674 0.1422553
## 8 0.8799136 0.2428036 0.3799339 0.1443498
## 9 0.8767524 0.2459889 0.3839084 0.1473857
## 10 0.8701438 0.2639150 0.3972979 0.1578456
## 11 0.8679375 0.2732919 0.4004749 0.1603801
## 12 0.8213509 0.3560844 0.4322389 0.1868305
##
## [12 rows x 7 columns]
h2o.performance(aml@leader, x_h2o_test)
## H2OBinomialMetrics: stackedensemble
##
## MSE: 0.09500837
## RMSE: 0.3082343
## LogLoss: 0.3185747
## Mean Per-Class Error: 0.147702
## AUC: 0.9443401
## AUCPR: 0.9668835
## Gini: 0.8886801
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 2223 665 0.230263 =665/2888
## 1 333 4779 0.065141 =333/5112
## Totals 2556 5444 0.124750 =998/8000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.536138 0.905457 212
## 2 max f2 0.274114 0.939168 296
## 3 max f0point5 0.726824 0.915289 142
## 4 max accuracy 0.573789 0.876375 200
## 5 max precision 0.978134 1.000000 0
## 6 max recall 0.096473 1.000000 359
## 7 max specificity 0.978134 1.000000 0
## 8 max absolute_mcc 0.599529 0.729772 192
## 9 max min_per_class_accuracy 0.665707 0.865997 167
## 10 max mean_per_class_accuracy 0.672841 0.867920 164
## 11 max tns 0.978134 2888.000000 0
## 12 max fns 0.978134 5110.000000 0
## 13 max fps 0.016442 2888.000000 399
## 14 max tps 0.096473 5112.000000 359
## 15 max tnr 0.978134 1.000000 0
## 16 max fnr 0.978134 0.999609 0
## 17 max fpr 0.016442 1.000000 399
## 18 max tpr 0.096473 1.000000 359
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
x_test_data <- read.csv("ml_project_2022_churn_test.csv")
x_test_h2o <- as.h2o(x_test_data)
## Warning in use.package("data.table"): data.table cannot be used without R
## package bit64 version 0.9.7 or higher. Please upgrade to take advangage of
## data.table speedups.
##
|
| | 0%
|
|======================================================================| 100%
predictions <- predict(aml@leader, x_test_h2o)
##
|
| | 0%
|
|======================================================================| 100%
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'country' has levels not trained on: ["BS", "DO"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'device_brand' has levels not trained on: ["Archos", "CASPER",
## "Energizer", "Essentielb", "FIGI", "Inco", "Kalley", "Philco", "TECLAST",
## "Tablet_Express", "Teclast", "along", "revoview"]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Test/Validation
## dataset column 'device_model' has levels not trained on: ["2201117TG",
## "5029Y_EEA", "8084_EEA", "8088X_EEA", "A1_A1 Alpha", "ASUS_X008DB",
## "ASUS_X00QDA", "ASUS_Z017D", "ASUS_Z01KD", "AUM-AL20", ...124 not listed...,
## "samsung_SM-A315F", "samsung_SM-A405FN", "samsung_SM-A505G", "samsung_SM-A605F",
## "samsung_SM-G610F", "samsung_SM-T290", "vivo 1814", "vivo 1815", "vivo vivo
## 1904", "vivo+1915"]
predictions
## predict p0 p1
## 1 0 0.94916405 0.05083595
## 2 1 0.14228735 0.85771265
## 3 1 0.22597792 0.77402208
## 4 1 0.04052658 0.95947342
## 5 1 0.11206460 0.88793540
## 6 1 0.10628928 0.89371072
##
## [5000 rows x 3 columns]
predictions$predict
## predict
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
##
## [5000 rows x 1 column]
churn_answers <- predictions$predict
churn_answers <- as.data.frame(churn_answers)
final_predicted <- cbind(x_test_data, churn_answers)
str(final_predicted)
## 'data.frame': 5000 obs. of 26 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ X_id : int 95655514 95654999 95654244 95653816 95653698 95653219 95652826 95652548 95652475 95652184 ...
## $ os : chr "android" "android" "android" "android" ...
## $ country : chr "TR" "BG" "BG" "GR" ...
## $ device_brand : chr "samsung" "samsung" "TCL" "samsung" ...
## $ device_model : chr "SM-A515F" "SM-A715F" "5024D_EEA" "SM-A600FN" ...
## $ session_cnt : int 1 4 7 397 3 5 1 6 1 1 ...
## $ session_length : int 1448 27120 1039 235357 514 2716 982 2278 501 1028 ...
## $ lang : chr "TR" "BU" "BU" "GR" ...
## $ max_lvl_no : int 25 13 11 2 2 12 17 23 7 21 ...
## $ gameplay_duration: int 1190 2525 641 112759 352 2646 689 1936 475 777 ...
## $ bonus_cnt : int 16 4 0 85 6 16 4 6 2 4 ...
## $ hint1_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hint2_cnt : int 0 6 0 11 0 6 5 0 3 1 ...
## $ hint3_cnt : int 0 0 0 1278 3 0 0 0 0 0 ...
## $ repeat_cnt : int 17 19 5 576 5 21 3 7 1 7 ...
## $ gold_cnt : int 581 971 1270 515 515 60 1141 886 100 814 ...
## $ claim_gold_cnt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ banner_cnt : int 47 105 19 2793 10 60 28 34 7 32 ...
## $ is_cnt : int 8 14 0 260 0 0 5 8 0 4 ...
## $ rv_cnt : int 1 2 0 1237 0 22 6 0 1 1 ...
## $ multi_lang : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ max_ses_length : int 1448 24966 478 24236 371 923 982 1469 501 1028 ...
## $ median_ses_length: int 1448 899 90 225 108 407 982 74 501 1028 ...
## $ avg_ses_length : int 1448 6780 148 592 171 543 982 379 501 1028 ...
## $ predict : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 1 2 1 1 ...
write.csv(final_predicted, "churn-analysis-notebook_files/churn_predictions.csv")
We predicted and added our predictions as a column to our test csv in the name “predict”.